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CLASSIFICATION OF BREAST CANCER PATIENTS USING A 
COMBINATION OF CLINICAL CRITERIA AND INFORMATIVE GENESETS 



[0001] This application claims the benefit under 35 U.S.C. § 1 19(e) of U.S. Provisional 
Patent Application No. 60/650,401, filed on February 4, 2 0 05, U.S. Provisional Patent 
Application No. 60/604,076, filed on August 24, 2004, and U.S. Provisional Patent 
Application No. 60/550,810, filed on March 5, 2004, each of which is incorporated by 
reference herein in its entirety. 

1 . FIELD OF THE INVENTION 

[0002] The present invention relates to the use of both phenotypic and genotypic aspects of 
a condition, such as a disease, in order to identify discrete subsets of patients for which 
specific sets of informative genes are then identified. The invention also relates to the 
classification of individuals, such as breast cancer patients, into a subset of the condition on 
the basis of clinical parameters and the status of markers, for example, of genes expression 
patterns, and the prognosis of those individuals on the basis of markers informative for 
prognosis within the subset of the condition. The invention also relates to methods of 
determining a course of treatment or therapy to an individual having, or suspected of having, 
a condition, such as breast cancer. The invention further relates to methods of structuring a 
clinical trial, particularly using five breast cancer-specific patient subsets and prognosis- 
informative genes for each, and of identifying patient populations for clinical trials or for 
other condition-related, for example, breast cancer-related, research. Finally, the invention 
relates to computer implementations of the above methods. 

2. BACKGROUND OF THE INVENTION 

[0003] The increased number of cancer cases reported in the United States, and, indeed, 
around the world, is a major concern. Currently there are only a handful of treatments 
available for specific types of cancer, and these provide no guarantee of success. In order to 
be most effective, these treatments require not only an early detection of the malignancy, but 
a reliable assessment of the severity of the malignancy. 

[0004] The incidence of breast cancer, a leading cause of death in women, has been 
gradually increasing in the United States over the last thirty years. Its cumulative risk is 
relatively high; 1 in 8 women are expected to develop some type of breast cancer by age 85 in 
the United States. In fact, breast cancer is the most common cancer in women and the second 
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most common cause of cancer death in the United States. In 1997, it was estimated that 
181,000 new cases were reported in the U.S., and that 44,000 people would die of breast 
cancer (Parker et al, CA Cancer J. Clin. 47:5-27 (1997); Chu et al, J. Nat. Cancer Inst. 
88:1571-1579 (1996)). While mechanism of tumorigenesis for most breast carcinomas is 
largely unknown, there are genetic factors that can predispose some women to developing 
breast cancer (Miki et al, Science, 266:66-71(1994)). 

[0005] Sporadic tumors, those not currently associated with a known germline mutation, 
constitute the majority of breast cancers. It is also likely that other, non-genetic factors also 
have a significant effect on the etiology of the disease. Regardless of the cancer's origin, 
breast cancer morbidity and mortality increases significantly if it is not detected early in its 
progression. Thus, considerable effort has focused on the early detection of cellular 
transformation and tumor formation in breast tissue. 

[0006] A marker-based approach to tumor identification and characterization promises 
improved diagnostic and prognostic reliability. Typically, the diagnosis of breast cancer 
requires histopathological proof of the presence of the tumor. In addition to diagnosis, 
histopathological examinations also provide information about prognosis and selection of 
treatment regimens. Prognosis may also be established based upon clinical parameters such 
as tumor size, tumor grade, the age of the patient, and lymph node metastasis. 
[0007] Diagnosis and/or prognosis may be determined to varying degrees of effectiveness 
by direct examination of the outside of the breast, or through mammography or other X-ray 
imaging methods (Jatoi, Am. J. Surg. 177:518-524 (1999)). The latter approach is not 
without considerable cost, however. Every time a mammogram is taken, the patient incurs a 
small risk of having a breast tumor induced by the ionizing properties of the radiation used 
during the test. In addition, the process is expensive and the subjective interpretations of a 
technician can lead to imprecision. For example, one study showed major clinical 
disagreements for about one-third of a set of mammograms that were interpreted individually 
by a surveyed group of radiologists. Moreover, many women find that undergoing a 
mammogram is a painful experience. Accordingly, the National Cancer Institute has not 
recommended mammograms for women under fifty years of age, since this group is not as 
likely to develop breast cancers as are older women. It is compelling to note, however, that 
while only about 22% of breast cancers occur in women under fifty, data suggests that breast 
cancer is more aggressive in pre-menopausal women. 

[0008] In clinical practice, accurate diagnosis of various subtypes of breast cancer is 
important because treatment options, prognosis, and the likelihood of therapeutic response all 
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vary broadly depending on the diagnosis. Accurate prognosis, or determination of distant 
metastasis-free survival could allow the oncologist to tailor the administration of adjuvant 
chemotherapy, with women having poorer prognoses being given the most aggressive 
treatment. Furthermore, accurate prediction of poor prognosis would greatly impact clinical 
trials for new breast cancer therapies, because potential study patients could then be stratified 
according to prognosis. Trials could then be limited to patients having poor prognosis, in 
turn making it easier to discern if an experimental therapy is efficacious. 
[0009] To date, no set of satisfactory predictors for prognosis based on the clinical 
information alone has been identified. Many have observed that the ER status has a dominant 
signature in the breast tumor gene expression profiling. See West et al, Proc. Natl. Acad. 
Set U.S.A. 98:1 1462 (2001); van 't Veer et a!., Nature 415:530 (2002); Sorlie et al, Proc. 
Natl. Acad Set U.S.A. 100:8418 (2003); Perou et al, Nature 406:747 (2000); Gruvberger et 
al, Cancer Res. 61:5979 (2001); Sotiriou et al, Proc. Natl Acad. Sci. U.S.A. 100:10393 
(2003). It is generally accepted that there is some relationship between patient survival and 
ER status, van de Vij ver et al. , N Engl. J. Med. 347: 1 999 (2002); Surowiak et al Folia 
Histochem. Cytobiol 39:143 (2001); Pichon et al, Br. J. Cancer 73:1545 (1996); Collett et 
al, J. Clin. Pathol. 49:920 (1996). BRCA1 mutations are related to the familial cancer 
susceptibility. Biesecker et al, JAMA 269:1970 (1993); Easton et al, Cancer Surv. 18:95 
(1993). Age is also considered to be a prognosis factor since young cancer patients tend to 
have poor tumors. Maggard et al, J. Surg. Res. 113:109(2003). Lymph node status is a 
factor in deciding the treatment. Eifel et al, J. Natl. Cancer Inst. 93:979 (2001). 
[0010] The discovery and characterization of BRCA1 and BRCA2 has recently expanded 
our knowledge of genetic factors which can contribute to familial breast cancer. Germ-line 
mutations within these two loci are associated with a 50 to 85% lifetime risk of breast and/or 
ovarian cancer (Casey, Curr. Opin. Oncol. 9:88-93 (1997); Marcus et al, Cancer 11:691- 
709 (1996)). Only about 5% to 10% of breast cancers, however, are associated with breast 
cancer susceptibility genes, BRCA1 and BRCA2. The cumulative lifetime risk of breast 
cancer for women who carry the mutant BRCA1 is predicted to be approximately 92%, while 
the cumulative lifetime risk for the non-carrier majority is estimated to be approximately 
10%. BRCA1 is a tumor suppressor gene that is involved in DNA repair and cell cycle 
control, which are both important for the maintenance of genomic stability. More than 90% 
of all mutations reported so far result in a premature truncation of the protein product with 
abnormal or abolished function. The histology of breast cancer in BRCA1 mutation carriers 
differs from that in sporadic cases, but mutation analysis is the only way to find the carrier. 
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Like BRCA1, BRCA2 is involved in the development of breast cancer, and like BRCA1 plays 
a role in DNA repair. However, unlike BRCA1, it is not involved in ovarian cancer. 
[001 1] Other genes have been linked to breast cancer, for example c-erb-2 {HER2) and p53 
(Beenken et ah, Ann. Surg. 233(5):630-638 (2001). Overexpression of c-erb-2 (HER2) and 
p53 have been correlated with poor prognosis (Rudolph et ah, Hum. Pathol. 32(3):31 1-319 
(2001), as has been aberrant expression products of mdm2 (Lukas et ah, Cancer Res. 
61(7):3212-3219 (2001) and cyclinl and p27 (Porter & Roberts, International Publication 
WO98/33450, published August 6, 1998). 

[0012] The detection of BRCA1 or BRCA2 mutations represents a step towards the design of 
therapies to better control and prevent the appearance of these tumors. Recently, many 
studies have used gene expression profiling to analyze various cancers, and those studies 
have provided new diagnosis and prognosis information in the molecular level. See 
Zajchowski et ah, "Identification of Gene Expression Profiled that Predict the Aggressive 
Behavior of Breast Cancer Cells," Cancer Res. 61 :5168 (2001); West et ah, "Predicting the 
Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Natl. 
Acad. Sci. U.S.A. 98:1 1462 (2001); van 't Veer et ah, "Gene Expression Profiling Predicts the 
Outcome of Breast Cancer," Nature 415:530 (2002); Roberts et ah, "Diagnosis and Prognosis 
of Breast Cancer Patients," WO 02/103320; Sorlie et ah, Proc. Natl. Acad. Sci. U.S.A. 
100:8418 (2003); Perou et ah, Nature 406:747 (2000); Khan et ah, Cancer Res 58, 5009 
(1998); Golub et ah, Science 286, 531 (1999); DeRisi et ah, Nat. Genet. 14:457 (1996); 
Alizadeh et ah, Nature 403, 503 (2000). Methods for the identification of informative 
genesets for various cancers have also been described. See Roberts et ah, "Diagnosis and 
Prognosis of Breast Cancer Patients," WO 02/103320; Golub et ah, United States Patent No. 
6,647,341. 

[0013] Genesets have been identified that are informative for differentiating individuals 
having, or suspected of having, breast cancer based on estrogen receptor (ER) status, or 
BRCA1 mutation vs. sporadic (i.e., other than BRCAJ-type) mutation status. See Roberts et 
ah, WO 02/103320; van't Veer et ah, Nature 415:530 (2001). Genesets have also been 
identified that enable the classification of sporadic tumor-type individuals as those who will 
likely have no metastases within five years of initial diagnosis (i.e., individuals with a good 
prognosis) or those who will likely have a metastasis within five years of initial diagnosis 
(i.e., those having a poor prognosis). Roberts, supra; van't Veer, supra. 
[0014] Roberts et ah WO 02/103320 describes a 70-gene set, useful for the prognosis of 
breast cancer, which outperformed clinical measures of prognosis, and which showed good 
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potential in selecting good outcome patients, thereby avoiding over-treatment, van de Vijver 
et al, N. Engl J. Med. 347:1999 (2002). The expression of genes with most predictive value, 
however, were not homogeneous among poor patients, suggesting the need for improvement. 
[0015] Although the patterns of gene expression as described in Roberts et al were 
correlated with existing clinical indicators such as estrogen receptor and BRCA1 status, 
clinical measures were not incorporated. Furthermore, although the poor-outcome group in 
particular showed heterogeneity in expression pattern, the best classifier decision rule found 
during these studies was a fairly simple one based on the similarity of a patient profile to the 
average profile of a good-outcome training group. 

[0016] It is evident that breast cancer is the result of more than one type of molecular event. 
Likewise, a variety of other conditions, such as other cancers; non-cancer diseases such as 
diabetes, autoimmune or neurodegenerative disorders, obesity; etc., are also the result of 
more than one molecular event. Moreover, an individual's response to exposure to particular 
environmental conditions, for example, exposure to natural or man-made agents, such as 
toxins, pollutants, drugs, food additives, etc., likely result from more than one molecular 
event. Thus, there exists a need for improved prognostic methods so that appropriate courses 
of prophylaxis and/or therapy may be provided. Genesets having improved prognostic power 
can be identified by first identifying discrete subsets of individuals based on genotypic or 
phenotypic characteristics relevant to the disease or condition, and then identifying genesets 
informative for prognosis within those subsets of patients. Individuals having the condition, 
or who are suspected of having the condition, such as breast cancer, would then be provided 
therapies appropriate to the molecular mechanisms underlying the condition. The present 
invention provides such methods for breast cancer, and for other cancers, diseases or 
conditions. 

3 . SUMMARY OF THE INVENTION 

[00 1 7] The present invention provides methods of identifying relevant subsets of 
conditions, and the identification of markers relevant to those subsets, for example, for 
prognosis of individuals classifiable into one of those subsets. The invention further provides 
sets of markers useful for the prognosis of individuals having breast cancer, wherein those 
patients have been classified according to one or more characteristics of breast cancer. 
[00 1 8] Thus, the present invention provides a method of identifying a set of informative 
genes or markers for a condition comprising a plurality of phenotypic or genotypic 
characteristics, comprising: (a) classifying each of a plurality of samples or individuals on the 
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basis of one or more phenotypic or genotypic characteristics of said condition into a plurality 
of first classes; and (b) identifying within each of said first classes a first set of genes or 
markers informative for said condition, wherein said first set of genes or markers within each 
of said first classes is unique to said class relative to other first classes. In a specific 
embodiment, this method further comprises additionally classifying into a plurality of second 
classes said samples or individuals in at least one of said first classes on the basis of a 
phenotypic or genotypic characteristic different that that used in said classifying step (a); and 
identifying within at least one of said second classes a second set of informative genes or 
markers, wherein said second set of informative genes or markers within each of said second 
classes is unique to said second class relative to other first and second classes. 
[001 9] The invention further provides a method of identifying a set of informative genes or 
markers for a condition comprising a plurality of phenotypic or genotypic characteristics, 
comprising: (a) classifying each of a plurality of samples or individuals on the basis of one or 
more phenotypic or genotypic characteristics into a plurality of first classes; (b) classifying at 
least one of said first classes into a plurality of second classes on the basis of phenotypic or 
genotypic characteristic different than that used in said classifying step (a); and (c) 
identifying within at least one of said first classes or said second classes a set of genes or 
markers informative for said condition, wherein said second set of genes or markers is unique 
to said class relative to other first and second classes. 

[0020] The invention further provides a method of identifying a set of informative genes or 
markers for a condition comprising a plurality of phenotypic or genotypic characteristics, 
comprising: (a) selecting a first characteristic from said plurality of phenotypic or genotypic 
characteristics; (b) identifying at least two first condition classes differentiable by said first 
characteristic; (c) selecting a plurality of individuals classifiable into at least one of said first 
condition classes; and (d) identifying in samples derived from each of said plurality of 
individuals a set of genes or markers informative for said condition within said at least one of 
said first condition classes. 

[002 1] The invention further provides a method of classifying an individual with a 
condition as having a good prognosis or a poor prognosis, comprising: (a) classifying said 
individual into one of a plurality of patient classes, said patient classes being differentiated by 
one or more phenotypic, genotypic or clinical characteristics of said condition; (b) 
determining the level of expression of a plurality of genes or their encoded proteins in a cell 
sample taken from the individual relative to a control, said plurality of genes or their encoded 
proteins comprising genes or their encoded proteins informative for prognosis of the patient 
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class into which said individual is classified; and (c) classifying said individual as having a 
good prognosis or a poor prognosis on the basis of said level of expression. In a specific 
embodiment, said condition is cancer, said good prognosis is the non-occurrence of 
metastases within five years of initial diagnosis, and said poor prognosis is the occurrence of 
metastases within five years of initial diagnosis. In a more specific embodiment, said cancer 
is breast cancer. In another specific embodiment, said control is the average level of 
expression of each of said plurality of genes or their encoded proteins across a plurality of 
samples derived from individuals identified as having a poor prognosis. In a more specific 
embodiment, said classifying step (c) is carried out by a method comprising comparing the 
level of expression of each of said plurality of genes or their encoded proteins to said average 
level of expression of each corresponding gene or its encoded protein in said control, and 
classifying said individual as having a poor prognosis if said level of expression correlates 
with said average level of expression of each of said genes or their encoded proteins in said 
control more strongly than would be expected by chance. In another specific embodiment, 
said control is the average level of expression of each of said plurality of genes or their 
encoded proteins across a plurality of samples derived from individuals identified as having a 
good prognosis. In a more specific embodiment, said classifying in step (c) is carried out by 
a method comprising comparing the level expression of each of said plurality of genes or 
their encoded proteins to said average level of expression of each corresponding gene or its 
encoded protein in said control, and classifying said individual as having a good prognosis if 
said level of expression correlates with said average level of expression of each of said genes 
or their encoded proteins in said control more strongly than would be expected by chance. In 
another specific embodiment, said plurality of patient classes comprises ER~, BRCAI 
individuals; ER~, sporadic individuals; ER+, ER/AGE high individuals; ER+, ER/AGE low, 
LN+ individuals; and ER+, ER/AGE low, LN~ individuals. 

[0022] The invention further provides a method of classifying a breast cancer patient as 
having a good prognosis or a poor prognosis comprising: (a) classifying said breast cancer 
patient as ER~ BRCAI; ER~ sporadic; ER+, ER/AGE high; ER+, ER/AGE low, LN+; or 
ER+, ER/AGE low, LN~; (b) determining the level of expression of a first plurality of genes 
in a cell sample taken from said breast cancer patient relative to a control, said first plurality 
of genes comprising two of the genes corresponding to the markers in Table 1 if said breast 
cancer patient is classified as ER~~, BRCAI; in Table 2 if said breast cancer patient is 
classified as ER~ sporadic; in Table 3 if said breast cancer patient is classified as ER+, 
ER/AGE high; in Table 4 if said breast cancer patient is classified as ER+. ER/AGE low, 
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LN+; or in Table 5 if said breast cancer patient is classified as ER+, ER/AGE low, LN~; and 
(c) classifying said breast cancer patient as having a good prognosis or a poor prognosis on 
the basis of the level of expression of said first plurality of genes, wherein said breast cancer 
patient is "ER/AGE high" if the ratio of the logio(ratio) of ER gene expression to age exceeds 
a predetermined value, and "ER/AGE low" if the ratio of the log i0 (ratio) of ER gene 
expression to age does not exceed said predetermined value. In a specific embodiment, said 
control is the average level of expression of each of said plurality of genes in a plurality of 
samples derived from ER", BRCA1 individuals, if said breast cancer patient is ER", BRCA1; 
the average level of expression of each of said plurality of genes in a plurality of samples 
derived from ER", sporadic individuals if said breast cancer patient is ER", sporadic; the 
average level of expression of each of said plurality of genes in a plurality of samples derived 
from ER+, ER/AGE high individuals, if said breast cancer patient is ER+, ER/AGE high; the 
average level of expression of each of said plurality of genes in a plurality of samples derived 
from ER+, ER/AGE low, LN+ individuals where said breast cancer patient is ER+, ER/AGE 
low, LN+; or the average level of expression of each of said plurality of genes in a plurality 
of samples derived from ER+, ER/AGE low, LIST individuals where said breast cancer patient 
is ER+, ER/AGE low, LN~. In a more specific embodiment, each of said individuals has a 
poor prognosis. In another more specific embodiment, each of said individuals has a good 
prognosis. In an even more specific embodiment, said classifying step (c) is carried out by a 
method comprising comparing the level of expression of each of said plurality of genes or 
their encoded proteins in a sample from said breast cancer patient to said control, and 
classifying said breast cancer patient as having a poor prognosis if said level of expression 
correlates with said average level of expression of the corresponding genes or their encoded 
proteins in said control more strongly than would be expected by chance. In another specific 
embodiment, said predetermined value of ER is calculated as ER = 0.1 (AGE - 42.5), wherein 
AGE is the age of said individual. In another specific embodiment, said individual is ER", 
BRCA1, and said plurality of genes comprises two of the genes for which markers are listed 
in Table 1. In another specific embodiment, said individual is ER~, BRCA1, and said 
plurality of genes comprises all of the genes for which markers are listed in Table 1 . In 
another specific embodiment, said individual is ER~, sporadic, and said plurality of genes 
comprises two of the genes for which markers are listed in Table 2. said individual is ER", 
sporadic, and said plurality of genes comprises all of the genes for which markers are listed in 
Table 2. In another specific embodiment, said individual is ER+, ER/AGE high, and said 
plurality of genes comprises two of the genes for which markers are listed in Table 3. said 
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individual is ER+, ER/AGE high, and said plurality of genes comprises all of the genes for 
which markers are listed in Table 3. In another specific embodiment, said individual is ER+, 
ER/AGE low, LN+, and said plurality of genes comprises two of the genes for which markers 
are listed in Table 4. In another specific embodiment, said individual is ER+, ER/AGE low, 
LN+, and said plurality of genes comprises all of the genes for which markers are listed in 
Table 4. In another specific embodiment, said individual is ER+, ER/AGE low, LIST, and 
said plurality of genes comprises two of the genes for which markers are listed in Table 4. In 
another specific embodiment, said individual is ER+, ER/AGE low, LN~, and said plurality of 
genes comprises all of the genes for which markers are listed in Table 4. In another specific 
embodiment, the method further comprises determining in said cell sample the level of 
expression, relative to a control, of a second plurality of genes for which markers are not 
found in Tables 1-5, wherein said second plurality of genes is informative for prognosis. 
[0023] In another embodiment, the invention provides a method for assigning an individual 
to one of a plurality of categories in a clinical trial, comprising: (a) classifying said individual 
as ER~, BRCA1, ER", sporadic; ER+, ER/AGE high; ER+, ER/AGE low, LN+; or ER+, 
ER/AGE low, LIST; (b) determining for said individual the level of expression of at least two 
genes for which markers are listed in Table 1 if said individual is classified as ER", BRCA1; 
Table 2 if said individual is classified as ER", sporadic; Table 3 if said individual is classified 
as ER+, ER/AGE high; Table 4 if said individual is classified as ER+, ER/AGE low, LN+; or 
Table 5 if said individual is classified as ER+, ER/AGE low, LN~; (c) determining whether 
said individual has a pattern of expression of said at least two genes that correlates with a 
good prognosis or a poor prognosis; and (d) assigning said individual to one category in a 
clinical trial if said individual has a good prognosis, and assigning said individual to a second 
category in said clinical trial if said individual has a poor prognosis. In a specific 
embodiment, said individual is additionally assigned to a category in said clinical trial on the 
basis of the classification of said individual as determined in step (a). In another specific 
embodiment, said individual is additionally assigned to a category in said clinical trial on the 
basis of any other clinical, phenotypic or genotypic characteristic of breast cancer. In another 
specific embodiment, said method further comprises determining in said cell sample the level 
of expression, relative to a control, of a second plurality of genes for which markers are not 
found in Tables 1-5, wherein said second plurality of genes is informative for prognosis of 
breast cancer, and determining from the expression of said second plurality of genes, in 
addition to said first plurality of genes, whether said individual has a good prognosis or a 
poor prognosis. 
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[0024] The invention further provides a microarray comprising probes complementary and 
hybridizable to a plurality of the genes for which markers are listed in any of Tables 1-5. The 
invention further provides a microarray comprising probes complementary and hybridizable 
to a plurality of the genes for which markers are listed in Table 1, each of the genes for which 
markers are listed in Table 1, a plurality of the genes for which markers are listed in Table 2, 
each of the genes for which markers are listed in Table 2, a plurality of the genes for which 
markers are listed in Table 3, each of the genes for which markers are listed in Table 3, a 
plurality of the genes for which markers are listed in Table 4, each of the genes for which 
markers are listed in Table 4, a plurality of the genes for which markers are listed in Table 5, 
or each of the genes for which markers are listed in Table 5. The invention further provides 
any one of the above microarrays, wherein said probes are at least 50% of the probes on said 
microarray. The invention further provides any one of the above microarrays, wherein said 
probes are at least 90% of the probes on said microarray. The invention further provides 
microarray comprising probes complementary and hybridizable to a plurality of the genes for 
which markers are listed in any of Tables 1-5, wherein said probes are complementary and 
hybridizable to at least 75% of the genes for which markers are listed in Table 1; are 
complementary and hybridizable to at least 75% of the genes for which markers are listed in 
Table 2; are complementary and hybridizable to at least 75% of the genes for which markers 
are listed in Table 3; are complementary and hybridizable to at least 75% of the genes for 
which markers are listed in Table 4; and are complementary and hybridizable to at least 75% 
of the genes for which markers are listed in Table 5, wherein said probes, in total, are at least 
50% of the probes on said microarray. 

[0025] The invention further comprises a kit comprising any one of the above microarrays 
in a sealed container. 

[0026] The invention further provides a method of identifying a set of genes informative for 
a condition, said condition having a plurality of phenotypic or genotypic characteristics such 
that samples may be categorized by at least one of said phenotypic or genotypic 
characteristics into at least one characteristic class, said method comprising: (a) selecting a 
plurality of samples from individuals having said condition; (b) identifying a first set of genes 
informative for said characteristic class using said plurality of samples; (c) predicting the 
characteristic class of each of said plurality of samples; (d) discarding samples for which said 
characteristic class is incorrectly predicted; (e) repeating steps (c) and (d) at least once; and 
(f) identifying a second set of genes informative for said characteristic class using samples in 
said plurality of samples remaining after step (e). 
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[0027] The invention further provides a method for assigning an individual to one of a 
plurality of categories in a clinical trial, comprising: (a) classifying the individual into one of 
a plurality of condition categories differentiated by at least one genotypic or phenotypic 
characteristic of the condition; (b) determining the level of expression, in a sample derived 
from said individual, of a plurality of genes informative for said condition category; (c) 
determining whether said level of expression of said plurality of genes indicates that the 
individual has a good prognosis or a poor prognosis; and (d) assigning the individual to a 
category in a clinical trial on the basis of prognosis. 

[0028] The invention also provides a method for identifying one or more sets of informative 
genes or markers for a condition in an organism, comprising: (a) subdividing a plurality of 
individuals or samples derived therefrom of the organism subject to the condition into a 
plurality of classes based on one or more clinical, phenotypic or genotypic characteristics of 
the organism, wherein each class consists of a plurality of individuals or samples derived 
therefrom of the organism each of which having one or more clinical, phenotypic or 
genotypic characteristics specific for the class; and (b) attempting to identify for each of one 
or more of said plurality of classes a set of genes or markers informative for said condition in 
individuals in said class, wherein, if a set of genes or markers informative for said condition 
in individuals in said class is obtained for any of said one or more of said plurality of classes, 
said set of genes or markers is taken as a set of informative genes or markers for said 
condition in said organism. 

[0029] In one embodiment, the method further comprises, for each of one or more of said 
classes in which a set of genes or markers informative for said condition in individuals in said 
class cannot be obtained, repeating said steps (a) and (b) on said plurality of individuals or 
samples derived therefrom in said class such that said plurality of individuals or samples 
derived therefrom in said class is subdivided into a plurality of additional classes based on 
one or more clinical, phenotypic or genotypic characteristics of said organism which are 
different from those used for defining said class, wherein for each of said plurality of 
additional classes, if a set of genes or markers informative for said condition in individuals in 
said class is obtained, said set of genes or markers is taken as a set of informative genes or 
markers for said condition in said organism. 

[0030] The invention also provides a method for identifying one or more sets of informative 
genes or markers for a condition in an organism, comprising: (a) subdividing a plurality of 
individuals or samples derived therefrom of said organism subject to said condition into a 
plurality of classes based on one or more clinical, phenotypic or genotypic characteristics of 
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said organism, wherein each said class consists of a plurality of individuals or samples 
derived therefrom of said organism each having said one or more clinical, phenotypic or 
genotypic characteristics specific for said class; (b) attempting to identify for each of one or 
more of said plurality of classes a set of genes or markers informative for said condition in 
individuals in said class, wherein if a set of genes or markers informative for said condition in 
individuals in said class is identified for any of said one or more of said classes, said set of 
genes or markers is taken as a set of informative genes or markers for a condition in said 
organism; and (c) for each of one or more of said classes in which a set of genes or markers 
informative for said condition in individuals in said class cannot be obtained, repeating said 
steps (a) and (b) on said plurality of individuals or samples derived therefrom in said class 
such that said plurality of samples or individuals in said class is subdivided into a plurality of 
additional classes based on one or more clinical, phenotypic or genotypic characteristics of 
said organism which are different from those used those used for defining said class, wherein 
for each of one or more of said plurality of additional classes, if a set of genes or markers 
informative for said condition in individuals in said class is obtained, said set of genes or 
markers is taken as a set of informative genes or markers for a condition in said organism. 
[003 1] In the methods of the invention, the condition can be a type of cancer. In such an 
embodiment, each of said sets of genes or markers can be informative of prognosis of 
individuals in a corresponding class. In one embodiment, the condition is breast cancer, and 
the one or more clinical, phenotypic or genotypic characteristics comprise age, ER level, 
ER/AGE, BRAC1 status, and lymph node status. 

[0032] In one embodiment, the methods of the invention further comprise generating a 
template profile comprising measurements of levels of genes or markers of the set of 
informative genes or markers for said class representative of levels of the genes or markers in 
a plurality of patients having a chosen prognosis level. 

[0033] The invention also provides a method for predicting a breast cancer patient as having 
a good prognosis or a poor prognosis, comprising: (a) classifying said breast cancer patient 
into one of the following classes: (al) ER~ BRCAI; (a2) ER~ sporadic; (a3) ER+, ER/AGE 
high; (a4) ER+, ER/AGE low, LN+; or (a5) ER+, ER/AGE low, LN~; (b) determining a 
profile comprising measurements of a plurality of genes or markers in a cell sample taken 
from said breast cancer patient, said plurality of genes markers comprising at least two of the 
genes or markers corresponding to the markers in (bl) Table 1 if said breast cancer patient is 
classified as ER", BRCAI; (b2) Table 2 if said breast cancer patient is classified as ER~ 
sporadic; (b3) Table 3 if said breast cancer patient is classified as ER+, ER/AGE high; (b4) 
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Table 4 if said breast cancer patient is classified as ER+, ER/AGE low, LN+; or (b5) Table 5 
if said breast cancer patient is classified as ER+, ER/AGE low, LN~; and (c) classifying said 
breast cancer patient as having a good prognosis or a poor prognosis based on said profile of 
said plurality of genes or markers, wherein ER + designates a high ER level and ER~ 
designates a low ER level, wherein said ER/AGE is a metric of said ER level relative to the 
age of said patient, and wherein LN 4 " designates a greater than 0 lymph nodes status in said 
patient and LN" designates a 0 lymph nodes status in said patient. 

[0034] In one embodiment, step (c) is carried out by a method comprising comparing said 
profile to a good prognosis template and/or a poor prognosis template, and wherein said 
patient is classified as having a good prognosis if said profile has a high similarity to a good 
prognosis template or has a low similarity to a poor prognosis template or as having a poor 
prognosis if said profile has a low similarity to a good prognosis template or has a high 
similarity to a poor prognosis template. A good prognosis template comprises measurements 
of said plurality of genes or markers representative of levels of said genes or markers in a 
plurality of good outcome patients, while a poor prognosis template comprises measurements 
of said plurality of genes or markers representative of levels of said genes or markers in a 
plurality of poor outcome patients. Here a good outcome patient is a breast cancer patient 
who has non-reoccurrence of metastases within a first period of time after initial diagnosis, 
while a poor outcome patient is a patient who has reoccurrence of metastases within a second 
period of time after initial diagnosis. 

[0035] In another embodiment, the methods for predicting the prognosis of a breast cancer 
patient further comprise determining said profile, said ER level, said LN status, and/or, said 
ER/AGE. In one embodiment, said profile is an expression profile comprising measurements 
of a plurality of transcripts in a sample derived from said patient, wherein said good 
prognosis template comprises measurements of said plurality of transcripts representative of 
expression levels of said transcripts in said plurality of good outcome patients, and wherein 
said poor prognosis template comprises measurements of said plurality of transcripts 
representative of expression levels of said transcripts in said plurality of poor outcome 
patients. 

[0036] In one embodiment, said expression profile is a differential expression profile 
comprising differential measurements of said plurality of transcripts in said sample derived 
from said patient versus measurements of said plurality of transcripts in a control sample. 



- 13- 



WO 2005/086891 



PCT/US2005/007894 



[0037] In one embodiment, the measurement of each said transcript in said good prognosis 
template is an average of expression levels of said transcript in said plurality of good 
outcome patients. 

[0038] In one embodiment, the similarity of said expression profile to said good or poor 
prognosis template is represented by a correlation coefficient between said expression profile 
and said good or poor prognosis template, respectively, and a correlation coefficient greater 
than a correlation threshold, e.g., 0.5, indicates a high similarity and said correlation 
coefficient equal to or less than said correlation threshold indicates a low similarity. 
[0039] In another embodiment, the similarity of said expression profile to said good or poor 
prognosis template is represented by a distance between said cellular constituent profile and 
said good or poor prognosis template, respectively, and a distance less than a given value 
indicates a high similarity and said distance equal to or greater than said given value indicates 
a low similarity. 

[0040] In another embodiment, said profile comprises measurements of a plurality of 
protein species in a sample derived from said patient, wherein said good prognosis template 
comprises measurements of said plurality of protein species representative of levels of said 
protein species in said plurality of good outcome patients, and wherein said poor prognosis 
template comprises measurements of said plurality of protein species representative of levels 
of said protein species in said plurality of poor outcome patients. 

[0041] In one embodiment, said ER level is determined by measuring an expression level of 
a gene encoding said estrogen receptor, e.g., the estrogen receptor a gene, in said patient 
relative to expression level of said gene in said control sample, and said ER level is classified 
as ER + if loglO(ratio) of said expression level is greater than -0.65, and said ER level is 
classified as ER" if loglO(ratio) of said expression level is equal to or less than -0.65. 
[0042] In one embodiment, said ER/AGE is classified as high if said ER level is greater 
than c-(AGE - d), and said ER/AGE is classified as low if said ER level is equal to or less 
than c{ AGE - d), wherein c is a coefficient, AGE is the age of said patient, and d is an age 
threshold. 

[0043] In a specific embodiment, said estrogen receptor level is measured by a 
polynucleotide probe that detects a transcript corresponding to the gene having accession 
number NM_000125, said control sample is a pool of breast cancer cells of different patients, 
and c = 0.1 and d= 42.5. 
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[0044] In one embodiment, said control sample is generated by pooling together cDNAs of 
said plurality of transcripts from a plurality of breast cancer patients. In another embodiment, 
said control sample is generated by pooling together synthesized cDNAs of said plurality of 
transcripts and said transcript of said gene encoding said estrogen receptor. 
[0045] In one embodiment, said individual is ERT, BRCA1, and said plurality of genes 
comprises at least two of the genes for which markers are listed in Table 1 . In one 
embodiment, said individual is ERT, BRCA1, and said plurality of genes comprises all of the 
genes for which markers are listed in Table 1 . 

[0046] In another embodiment, the individual is ER", sporadic, and said plurality of genes 
comprises at least two of the genes for which markers are listed in Table 2. In one 
embodiment, said individual is ER", sporadic, and said plurality of genes comprises all of the 
genes for which markers are listed in Table 2. 

[0047] In still another embodiment, said individual is ER+, ER/AGE high, and said plurality 
of genes comprises at least two of the genes for which markers are listed in Table 3. In one 
embodiment, said individual is ER+, ER/AGE high, and said plurality of genes comprises all 
of the genes for which markers are listed in Table 3. 

[0048] In still another embodiment, said individual is ER+, ER/AGE low, LN+, and said 
plurality of genes comprises at least two of the genes for which markers are listed in Table 4. 
In one embodiment, said individual is ER+, ER/AGE low, LN+, and said plurality of genes 
comprises all of the genes for which markers are listed in Table 4. 

[0049] In still another embodiment, said individual is ER+, ER/AGE low, LTST, and said 
plurality of genes comprises at least two of the genes for which markers are listed in Table 4. 
In one embodiment, the individual is ER+, ER/AGE low, LN~, and said plurality of genes 
comprises all of the genes for which markers are listed in Table 4. 

[0050] In one embodiment, said profile further comprises one or more genes for which 
markers are not found in Tables 1-5, which are informative for prognosis. 
[005 1] The invention also provides a method for assigning an individual to one of a 
plurality of categories in a clinical trial, comprising assigning said individual to one category 
in a clinical trial if said individual has a good prognosis as determined by any one of the 
methods described above, and assigning said individual to a second category in said clinical 
trial if said individual has a poor prognosis as determined by any one of the methods 
described above. 
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[0052] In one embodiment, said individual is additionally assigned to a category in said 
clinical trial on the basis of the classification of said individual based on said profile, said ER 
level, said LN status, and/or, said ER/AGE. 

[0053] In one embodiment, said individual is additionally assigned to a category in said 
clinical trial on the basis of one or more other clinical, phenotypic or genotypic characteristic 
of breast cancer. 

[0054] In one embodiment, the method further comprises determining in said cell sample 
the levels of expression of said one or more genes for which markers are not found in Tables 
1-5, and determining from said expression levels of said one or more genes, whether said 
individual has a good prognosis or a poor prognosis. 

4. BRIEF DESCRIPTION OF THE DRAWINGS 

[0055] FIG. 1 depicts the decision tree that resulted in the five patient subsets used to 
identify informative prognosis-related genes. 

[0056] FIG. 2: Relationship between ER level and age. (A) Scatter plot of ER vs. age for 
ER+ patients. Black dots indicate metastases free samples, and gray dots indicate metastases 
samples. It appears that patients of ER+ group can be subdivided into "ER+, ER/AGE high" 
group (above the black line) and "ER-h, ER/AGE low" (below the black line) group. The 
black line is approximated by ER = 0. 1 *(AGE-42.5), and the dashed line by ER = 0. 1 *(age- 
50). Within each population, the ER level also increases with age. (B) Age distribution of all 
patients in ER+ samples. A bimodal distribution is observed. (C) ER-modulated age (age - 
10*) distribution of all patients in ER+ samples. A bimodal distribution is observed. (D) 
Age distribution of samples with metastasis. (E) ER-modulated age distribution of samples 
with metastasis. The three peaks appearing in this distribution suggest a polymorphism. 
[0057] FIG. 3. Performance of classifier for the "ERVsporadic" group. (A) Error rate 
obtained from leave-one-out cross validation (LOOCV) for predicting the disease outcome as 
a function of the number of reporter genes used in the classifier. (B) Scatter plot between 
correlation to good group (X axis) and to poor group (Y axis). Circles indicate metastases- 
free samples, squares indicate samples with metastases. Dashed line: threshold for separating 
poor from good. (C) Error rate calculated with respect to good outcome group (good 
outcome misclassified as poor divided by total number of good), or poor outcome group 
(poor outcome misclassified as good divided by total number of poor), or the average of the 
two rates. 
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[0058] FIG. 4. Performance of classifier for the "ER+, ER/AGE high" group. (A) Error 
rate obtained from leave-one-out cross validation (LOOCV) for predicting the disease 
•outcome as a function of the number of reporter genes used in the classifier. (B) Scatter plot 
between correlation to good group (X axis) and to poor group (Y axis). Circles indicate 
metastases-free samples, and squares indicate samples with metastases. Dashed line: 
threshold for separating poor from good. (C) Error rate calculated with respect to good 
outcome group (good outcome misclassified as poor divided by total number of good), or 
poor outcome group (poor outcome misclassified as good divided by total number of poor), 
or the average of the two rates. 

[0059] FIG. 5. Performance of classifier for the "ER+, ER/AGE low/LN~" group. (A) 
Error rate obtained from leave-one-out cross validation (LOOCV) for predicting the disease 
outcome as a function of the number of reporter genes used in the classifier. (B) Scatter plot 
between correlation to good group (X axis) and to poor group (Y axis). Circles indicate 
metastases-free samples, and squares indicates samples with metastases. Dashed line 
indicates the threshold for separating poor from good. (C) Error rate calculated with respect 
to good outcome group (good outcome misclassified as poor divided by total number of 
good), or poor outcome group (poor outcome misclassified as good divided by total number 
of poor), or the average of the two rates. 

[0060] FIG. 6. Performance of classifier for the "ER+, ER/AGE low/LN+" group. (A) 
Error rate obtained from leave-one-out cross validation (LOOCV) for predicting the disease 
outcome as a function of the number of reporter genes used in the classifier. (B) Scatter plot 
between correlation to good group (X axis) and to poor group (Y axis). Circles indicate 
metastases free samples, squares indicate samples with metastases. Dashed line: threshold 
for separating poor from good. (C) Error rate calculated with respect to good outcome group 
(good outcome misclassified as poor divided by total number of good), or poor outcome 
group (poor outcome misclassified as good divided by total number of poor), or the average 
of the two rates. 

[0061] FIG. 7. Performance of classifier for the "ER~, BRCAV group. (A) Error rate 
obtained from leave-one-out cross validation (LOOCV) for predicting the disease outcome as 
a function of the number of reporter genes used in the classifier. (B) Scatter plot between 
correlation to good group (X axis) and to poor group (Y axis). Circles indicate metastases 
free samples, squares indicate samples with metastases. Dashed line: threshold for separating 
poor from good. (C) Error rate calculated with respect to good outcome group (good 
outcome misclassified as poor divided by total number of good), or poor outcome group 
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(poor outcome misclassified as good divided by total number of poor), or the average of the 
two rates. 

[0062] FIG. 8. Heatmaps of genes representing key biological functions in subgroups of 
patients: A: Cell cycle genes are predictive of outcome in patients with ER/age high. B: Cell 
cycle genes are not predictive of outcome in "ER- and sporadic" patients C: Glycolysis genes 
are predictive of outcome in patients with ER/age low and LN-. D: Glycolysis genes are not 
predictive of outcome in 'ER- & BRCA1" patients. 

5. DETAILED DESCRIPTION OF THE INVENTION 

5.1 INTRODUCTION 
[0063] The present invention provides methods for classifying individuals having a 
condition, such as a disease, into one or more subsets of individuals, where individuals in 
each subset are characterized by one or more phenotypic or genotypic characteristics of the 
condition. The individuals may be eukaryotes or prokaryotes, may be animals such as 
mammals, including but not limited to humans, primates, rodents, felines, canines, etc., birds, 
reptiles, fish, etc. "Individuals" as used herein also encompasses single-celled organisms, or 
colonies thereof, such as bacteria and yeast. The condition may be a disease, such as cancer, 
and may be a specific cancer, such as breast cancer. The condition may also be an 
environmental condition, such as exposure to a toxin, pollutant, drug, proximity to urban or 
industrial areas, etc. 

[0064] The present invention provides methods of determining the prognosis of individuals 
having a condition, such as cancer, for example, breast cancer, or who are suspected of 
having the condition, by the use of a combination of clinical, biological or biochemical 
parameters of the condition and gene expression pattern data. For prognosis, the parameters 
selected preferably relate to or affect the progression and/or outcome of the condition. The 
pattern of gene expression within a subset of individuals having the particular condition leads 
to the identification of sets of genes within a subset that is informative for that subset, for 
example, for prognosis within that subset. In general, the successful identification of sets of 
genes informative for prognosis within a particular subset justifies the selection of the 
plurality of clinical, biological or biochemical parameters of the condition on which division 
of individuals into condition subsets is based. 

[0065] In the example of breast cancer, patient groups are first classified according to at 
least one of age, lymph node (LN) status, estrogen receptor (ER) level, and BRCA1 mutation 
status into discrete patient subsets. These clinical factors have been implicated in tumor 
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etiology as well as differences in disease outcome. These characteristics are not limiting; 
other genotypic or phenotypic characteristics of breast cancer, for example, tumor grade, 
tumor size, tumor cell type, etc., may also be used, alone or in combination with those listed 
herein, in order to classify individuals. The differences in gene expression or in tumor fate 
related to these parameters likely represent differences in tumor origin and tumor genesis, and 
are therefore good candidates for tumor stratification. Genesets informative for prognosis 
within each subset are then identified. New breast cancer patients are then classified using 
the same criteria, and a prognosis is made based on the geneset specific for the patient subset 
into which the patient falls. In the process of constructing a prognosis classifier within each 
patient subset, particular attention is paid to the homogeneous patterns related to the tumor 
outcome. Emergence of such homogeneous prognosis patterns may indicate the most 
common mechanism to metastasis within a subset. At the same time, successful 
identification of such patterns also justifies the parameters being used for the tumor 
stratification. To differentiate this approach from an mRNA-alone approach, the current 
approach of integrating clinical data with the gene expression data is referred to herein as a 
"comprehensive prognosis". 

5.2 DEFINITIONS 
[0066] As used herein, "BRCA1 tumor" or "BRCA1 type" means a tumor having cells 
containing a mutation of the BRCA1 locus. 

[0067] The "absolute amplitude" of correlation means the absolute value of the correlation; 
e.g., both correlation coefficients -0.35 and 0.35 have an absolute amplitude of 0.35. 
[0068] "Marker" means a cellular constituent, or a modification of a cellular constituent 
{e.g., an entire gene, EST derived from that gene, a protein encoded by that gene, post- 
translational modification of the protein, etc.) the expression or level of which changes 
between certain conditions. Where a change in a characteristic of the constituent correlates 
with a certain condition, the constituent is a marker for that condition. 
[0069] "Marker-derived polynucleotides" means the RNA transcribed from a marker gene, 
any cDNA or cRNA produced therefrom, and any nucleic acid derived therefrom, such as 
synthetic nucleic acid having a sequence derived from the gene corresponding to the marker 
gene. 

[0070] A "similarity value" is a number that represents the degree of similarity between two 
things being compared. For example, a similarity value may be a number that indicates the 
overall similarity between a patient's expression profile of specific phenotype-related 
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markers and a template specific to that phenotype (for instance, the similarity to a "good 
prognosis" template, where the phenotype is a good prognosis). The similarity value may be 
expressed as a similarity metric, such as a correlation coefficient, or may simply be expressed 
as the expression level difference, or the aggregate of the expression level differences, 
between a patient sample and a template. 

[0071] A "patient subset" is a group of individuals, all of whom have a particular condition, 
or are subject to a particular condition, which is distinguished from other individuals having 
that condition by one or more phenotypic, genotypic or clinical characteristics of the 
condition, or of a response to the condition. For example, where the condition is breast 
cancer, individuals may belong to an "ER + " or an "ER~" patient subset, or may belong to a 
particular age group patient subset. 

[0072] A gene and/or marker is "informative" for a condition, phenotype, genotype or 
clinical characteristic if the expression of the gene or marker is correlated or anticorrelated 
with the condition, phenotype, genotype or clinical characteristic to a greater degree than 
would be expected by chance. 

[0073] An individual of a given age can be classified as "ER/AGE high" if the individual's 
ER level is higher than a threshold value for the given age. The threshold can be age- 
dependent, i.e., a different threshold for each different age. In one embodiment, the age- 
dependent threshold value is calculated as c • (AGE - d), where c is a coefficient, AGE is the 
age of the patient, and d is an age threshold. The parameters c and d depend on the ER level 
and AGE used. They can be determined by fitting patients' ER level-age distribution to a 
bimodal distribution of two subgroups each having a different ER level-age dependence. In a 
specific embodiment, c = 0.1 and d = 42.5 is used for ER levels represented by a log(ratio) of 
ER expression level. Thus, for example, the threshold for a 45-year old individual in this 
embodiment is 0.1 (45-42.5), or 0.25, and if the log(ratio) of ER expression level of the 
individual is equal to or greater than 0.25, the individual is classified as "ER/AGE high"; 
otherwise, the individual is classified as "ER/AGE low." 

5.3 IDENTIFICATION OF DIAGNOSTIC AND PROGNOSTIC MARKER 
SETS 

5.3.1 IDENTIFICATION OF CONDITION SUBSETS 
[0074] The present invention provides methods of identifying sets of genes and/or markers 
useful in the diagnosis and prognosis of breast cancer. More generally, the invention also 
provides methods of identifying sets of genes and/or markers useful in the diagnosis or 
prognosis of other cancers, and even more generally, of identifying sets of genes and/or 
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markers useful in the differentiation between subgroups of individuals having a particular 
condition, such as a disease or exposure to a particular environmental condition. 
[0075] The method may be applied to any condition for which a plurality of phenotypic or 
genotypic subsets may be identified. The condition may be a disease; for example, the 
condition may be cancer, an autoimmune disease, an inflammatory disease, an infectious 
disease, a neurological disease, a degenerative disease, etc. The condition may be 
environmental; for example, the condition may be a particular diet, geographic location, etc.; 
the condition may be exposure to a compound, including, for example, a drug, a toxin, a 
carcinogen, a foodstuff, a poison, an inhaled compound, an ingested compound, etc.; the 
condition may be a particular genetic background or predisposition to a medical condition; 
etc. 

[0076] Where the condition is cancer, the condition may be any cancer, for example, 
without limitation: leukemias, including acute leukemia, acute lymphocytic leukemia, acute 
myelocytic leukemia, myeloblasts leukemia, promyelocytic leukemia, myelomonocytic 
leukemia, monocytic leukemia, and erythroleukemia; chronic leukemia, such as chronic 
myelocytic (granulocytic) leukemia or chronic lymphocytic leukemia; polycythemia vera; 
lymphomas, such as Hodgkin's disease and non-Hodgkin's disease; multiple myeloma; 
Waldenstrom's macroglobulinemia; heavy chain disease; solid tumors, such as sarcomas and 
carcinomas, fibrosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, 
chordoma, angiosarcoma, endothelio sarcoma, lymphangiosarcoma, 

lymphangioendotheliosarcoma, synovioma, mesothelioma, Ewing's tumor, leiomyosarcoma, 
rhabdomyosarcoma, colon carcinoma, pancreatic cancer, breast cancer, ovarian cancer, 
prostate cancer, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland 
carcinoma, sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, 
cystadenocarcinoma, medullary carcinoma, bronchogenic carcinoma, renal cell carcinoma, 
hepatoma, bile duct carcinoma, choriocarcinoma, seminoma, embryonal carcinoma, Wilms' 
tumor, cervical cancer, testicular tumor, lung carcinoma, small cell lung carcinoma, bladder 
carcinoma, epithelial carcinoma, glioma, astrocytoma, medulloblastoma, craniopharyngioma, 
ependymoma, pinealoma, hemangioblastoma, acoustic neuroma, oligodendroglioma, 
menangioma, melanoma, neuroblastoma, or retinoblastoma; etc. 

[0077] Rather than stratifying individuals, such as patients or tumor samples derived from 
patients, by gene expression patterns in the first instance, the method of identifying sets of 
genes informative for a condition begins by identifying phenotypic, genotypic or clinical 
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subsets of individuals within the larger class of individuals having or affected by the 
condition. 

[0078] In one embodiment, the condition is cancer, and the subsets are distinguished by 
phenotypic, genotypic, and/or clinical characteristics of the cancer. In this embodiment, 
groups of individuals are classified according to one or more phenotypic, genotypic, or 
clinical characteristics relevant to the cancer into patient subsets. At any step in the process 
of subdividing a patient population into patient subsets, the expression level of one or more 
genes may be determined in order to identify whether a prognosis-informative set of genes 
may be identified for the particular patient subset. If an informative gene set is identified, but 
is not as informative as desired, the patient subset may be further divided and a new geneset 
identified. These subsets may be further subdivided. For example, a group of individuals 
affected by a particular cancer may be classified first on the basis of a phenotypic, genotypic 
or clinical characteristic A into subsets SI and S2. The levels of expression of a plurality of 
genes are then determined in tumor samples taken from individuals that fall within subsets SI 
or S2 in order to identify sets of genes informative for prognosis within these subsets. 
Subsets SI and S2 may then each be subdivided into two or more subsets based on other 
phenotypic, genotypic or clinical characteristics. The basis for subdivision, if performed, 
need not be the same for SI and S2. For example, in various embodiments, SI is not 
subdivided, while S2 is subdivided on the basis of characteristic B; or SI is subdivided based 
on characteristic B while S2 is not subdivided; or SI and S2 are both subdivided on the basis 
of characteristic B; or SI is subdivided based on characteristic B, while S2 is subdivided 
according to characteristic C; and so on. For a particular decision matrix leading to a 
plurality of patient subsets, the preferred outcome is a prognosis-informative set of genes for 
each patient subset. Different decision matrices may lead to different patient subsets, which, 
in turn, may result in different sets of prognosis-informative genes. 
[0079] In the specific example of breast cancer, a plurality of phenotypic, genotypic or 
clinical indications are used to classify a patient as being a member of one of a plurality of 
patient subsets, wherein the indications are medically, biochemically or genetically relevant 
to breast cancer. For example, a group of patients may be classified into patient subsets 
based on criteria including, but not limited to, estrogen receptor (ER) status, type of tumor 
(i.e., BRCAlAype or sporadic), lymph node status, grade of cancer, invasiveness of the tumor, 
or age. "BRCAl-type" indicates that the BRCA1 mutation is present. In each classification 
step, a group of cancer patients may be classified into only two classes, for example, ER+ or 
ER"~, or into three or more subsets (for example, by tumor grade), depending upon the 
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characteristic used to determine the subsets. As used herein, "ER+" indicates that the 
estrogen receptor is expressed at some elevated level; for example, it may indicate that the 
estrogen receptor is detectably expressed, or may indicate that more than 10% of cells are 
histologically stained for the receptor, etc. Conversely, "ER-" indicates that the estrogen 
receptor is expressed at a reduced level or not at all; for example, it may indicate that the 
receptor is not detectably expressed, or that 10% or less of cells are histologically stained for 
the receptor, etc. Marker gene sets optimized for each phenotypic class are preferably 
determined after the subsets are established. Where informative markers for a particular 
patient subset, distinguished from another subset by a particular characteristic of the 
condition of interest, cannot be determined, the subset may be further divided by another 
characteristic of the condition to create a plurality of second patient subsets, whereupon genes 
informative for these second patient subsets may be identified. 

[0080] FIG. 1 depicts the process, described in the Examples, of subdivision of a collection 
of breast cancer patients according to phenotypic and genotypic characteristics relevant to 
breast cancer, in preparation for identification of genes informative for prognosis. A 
collection of breast cancer tumor samples was first subdivided by estrogen receptor status. 
ER status was chosen because the presence or absence of the estrogen receptor greatly 
influences the expression of other genes. In the ER+ patient subset, it was noted that patients 
appeared to be bimodally distributed by ER level vs. age; that is, ER level dependence upon 
age tended to fall within two classes, as separated by the solid line in FIG. 2A. This 
bimodality was used to further subdivide ER+ individuals into "ER+, ER/AGE high" 
individuals and "ER+, ER/AGE low" individuals. A set of informative genes was identified 
for the ER+, ER/AGE high patient subset. An informative set was not identified for the ER+, 
ER/AGE low subset, however, so the subset of patients was further divided into LN+ and 
LN- individuals. Thus, in one embodiment, the present invention provides a method of 
identifying a set of informative genes or markers for a condition comprising a plurality of 
phenotypic or genotypic characteristics, comprising (a) classifying each of a plurality of 
samples or individuals on the basis of one phenotypic or genotypic characteristic into a 
plurality of first classes; and (b) identifying within each of said first classes a set of 
informative genes or markers, wherein said set of informative genes or markers within each 
said first classes is unique to said class. 

5.3.2 IDENTIFICATION OF MARKER SETS INFORMATIVE FOR 
PATIENT SUBSETS 
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[0081] Once a patient subset is identified, markers, such as genes, informative for a 
particular outcome, such as prognosis, may be identified. In one embodiment, the method for 
identifying marker sets is as follows. This example describes the use of genes and gene- 
derived nucleic acids as markers; however, proteins or other cellular constituents may be used 
as markers of the condition. 

[0082] After extraction and labeling of target polynucleotides, the expression of a plurality 
of markers, such as genes, in a sample X is compared to the expression of the plurality 
markers in a standard or control. In one embodiment, the standard or control comprises 
target markers, such as polynucleotide molecules, derived from one or more samples from a 
plurality of normal individuals, or a plurality of individuals not exposed to a particular 
condition. For example, the control, or normal, individuals may be persons without the 
particular disease or condition of interest (e.g., individuals not afflicted with breast cancer, 
where breast cancer is the disease of interest), or may be an individual not exposed to a 
particular environmental condition. The standard or control may also comprise target 
polynucleotide molecules, derived from one or more samples derived from individuals having 
a different form or stage of the same disease; a different disease or different condition, or 
individuals exposed or subjected to a different condition, than the individual from which 
sample X was obtained. The control may be a sample, or set of samples, taken from the 
individual at an earlier time, for example, to assess the progression of a condition, or the 
response to a course of therapy. 

[0083] In a preferred embodiment, the standard or control is a pool of target polynucleotide 
molecules. However, where protein levels, or the levels of any other relevant biomolecule, 
are to be compared, the pool may be a pool of proteins or the relevant biomolecule. In a 
preferred embodiment in the context of breast cancer, the pool comprises samples taken from 
a number of individuals having sporadic-type tumors. 

[0084] In another preferred embodiment, the pool comprises an artificially-generated 
population of nucleic acids designed to approximate the level of nucleic acid derived from 
each marker found in a pool of marker-derived nucleic acids derived from tumor samples. In 
another embodiment, the pool, also called a "mathematical sample pool," is represented by a 
set of expression values, rather than a set of physical polynucleotides; the level of expression 
of relevant markers in a sample from an individual with a condition, such as a disease, is 
compared to values representing control levels of expression for the same markers in the 
mathematical sample pool. Such a control may be a set of values stored on a computer. Such 
artificial or mathematical controls may be constructed for any condition of interest. 
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[0085] In another embodiment specific to breast cancer, the pool is derived from normal or 
breast cancer cell lines or cell line samples. In a preferred embodiment, the pool comprises 
samples taken from individuals within a specific patient subset, e.g., "ER+, ER/AGE high" 
individuals, wherein each of said individuals has a good prognosis, or each of said individuals 
has a poor prognosis. Of course, where, for example, expressed proteins are used as markers, 
the proteins are obtained from the individual's sample, and the standard or control could be a 
pool of proteins from a number of normal individuals, or from a number of individuals 
having a particular state of a condition, such as a pool of samples from individuals having a 
particular prognosis of breast cancer. 

[0086] The comparison may be accomplished by any means known in the art. For example, 
expression levels of various markers may be assessed by separation of target polynucleotide 
molecules (e.g., RNA or cDNA) derived from the markers in agarose or polyacrylamide gels, 
followed by hybridization with marker-specific oligonucleotide probes. Alternatively, the 
comparison may be accomplished by the labeling of target polynucleotide molecules 
followed by separation on a sequencing gel. Polynucleotide samples are placed on the gel 
such that patient and control or standard polynucleotides are in adjacent lanes. Comparison 
of expression levels is accomplished visually or by means of densitometer. In a preferred 
embodiment, the expression of all markers is assessed simultaneously by hybridization to a 
microarray. In each approach, markers meeting certain criteria are identified as informative 
for the prognosis of breast cancer. 

[0087] Marker genes are selected based upon significant difference of expression in a 
condition, such as a disease, as compared to a standard or control condition. Marker genes 
may be screened, for example, by determining whether they show significant variation within 
a set of samples of interest. Genes that do not show a significant amount of variation within 
the set of samples are presumed not to be informative for the disease or condition, and are not 
selected as markers for the disease or condition. Genes showing significant variation within 
the sample set are candidate informative genes for the disease or condition. The degree of 
variation may be estimated by calculating the difference of the expression of the gene, or 
ratio of expression between sample and control, within the set of samples. The expression, or 
ratio of expressions, may be transformed by any means, e.g., linear or log transformation. 
Selection may be made based upon either significant up- or down regulation of the marker in 
the patient sample. Selection may also be made by calculation of the statistical significance 
(i.e., the p- value) of the correlation between the expression of the marker and the disease and 
condition. Preferably, both selection criteria are used. Thus, in one embodiment of the 
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present invention, markers associated with prognosis of breast cancer within a patient subset 
are selected where the markers show both more than two-fold change (increase or decrease) 
in expression as compared to a standard, and the p-value for the correlation between the 
existence of breast cancer and the change in marker expression is no more than 0.01 (i.e., is 
statistically significant). 

[0088] In the context of the present invention, "good prognosis" indicates a desired outcome 
for a particular condition, especially a particular disease, and "poor prognosis" indicates an 
undesired outcome of the condition. For example, where the condition is cancer, a "good 
prognosis" may mean partial or complete remission, and "poor prognosis" may mean 
reappearance of the cancer after treatment. What constitutes "good prognosis" and "poor 
prognosis" is specific to the condition of interest, for example, specific to the particular 
cancer an individual suffers. For example, "good prognosis" for pancreatic cancer may be 
survival for one or two years after initial diagnosis, while "good prognosis" for Hodgkin's 
disease may be survival for five years or more. In the specific example of breast cancer, 
"good prognosis" means the likelihood of non-reoccurrence of metastases within a period of 
1, 2, 3, 4, 5 or more years after initial diagnosis, and "poor prognosis" means the likelihood 
of reoccurrence of metastasis within that period. In a more specific example, "good 
prognosis" means the likelihood of non-reoccurrence of metastases within 5 years after initial 
diagnosis, and "poor prognosis" means the likelihood of reoccurrence of metastasis within 
that period. 

[0089] In a more specific embodiment for cancer, for example, breast cancer, using a 
number of breast cancer tumor samples, markers are identified by calculation of correlation 
coefficients p between the clinical category or clinical parameter(s) c and the linear, 
logarithmic or any transform of the expression ratio r across all samples for each individual 
gene. Specifically, the correlation coefficient may be calculated as: 
[0090] = r)/(\\c\\ • \\r ||) Equation (1) 

[0091] Markers for which the coefficient of correlation exceeds a cutoff are identified as 
prognosis-informative markers specific for a particular clinical type, e.g., good prognosis, 
within a given patient subset. Such a cutoff or threshold may correspond to a certain 
significance of discriminating genes obtained by Monte Carlo simulations. The threshold 
depends upon the number of samples used; the threshold can be calculated as 3 X 1 
where l/Vw-3 is the distribution width and n = the number of samples. In a specific 
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embodiment, markers are chosen if the correlation coefficient is greater than about 0.3 or less 
than about -0.3. 

[0092] Next, the significance of the correlation is calculated. This significance may be 
calculated by any statistical means by which such significance is calculated. In a specific 
example, a set of correlation data is generated using a Monte-Carlo technique to randomize 
the association between the expression difference of a particular marker and the clinical 
category. The frequency distribution of markers satisfying the criteria in the Monte-Carlo 
runs is used to determine whether the number of markers selected by correlation with clinical 
data is significant. 

[0093] Once a marker set is identified, the markers may be rank-ordered in order of 
significance of discrimination. One means of rank ordering is by the amplitude of correlation 
between the change in gene expression of the marker and the specific condition being 
discriminated. Another, preferred, means is to use a statistical metric. In a specific 
embodiment, the metric is a t-test-like statistic: 

[0094] r = (( x i)"( X2 )K— T7 — — Equation (2) 

[0095] In this equation, (x x ) is the error-weighted average of the log ratio of transcript 
expression measurements within a first clinical group {e.g., good prognosis), (x 2 ) is the error- 
weighted average of log ratio within a second, related clinical group {e.g., poor prognosis), 
<y l is the variance of the log ratio within the first clinical group {e.g., good prognosis), n\ is 
the number of samples for which valid measurements of log ratios are available, cr 2 ls the 
variance of log ratio within the second clinical group {e.g., poor prognosis), and n 2 is the 
number of samples for which valid measurements of log ratios are available. The /-value 
represents the variance-compensated difference between two means. 

[0096] The rank-ordered marker set may be used to optimize the number of markers in the 
set used for discrimination. This is accomplished generally in a "leave one out" method as 
follows. In a first run, a subset, for example five, of the markers from the top of the ranked 
list is used to generate a template, where out of X samples, X-l are used to generate the 
template, and the status of the remaining sample is predicted. This process is repeated for 
every sample until every one of the X samples is predicted once. In a second run, additional 
markers, for example five additional markers, are added, so that a template is now generated 
from 10 markers, and the outcome of the remaining sample is predicted. This process is 
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repeated until the entire set of markers is used to generate the template. For each of the runs, 
type 1 error (false negative) and type 2 errors (false positive) are counted; the optimal number 
of markers is that number where the type 1 error rate, or type 2 error rate, or preferably the 
total of type 1 and type 2 error rate is lowest. 

[0097] For prognostic markers, validation of the marker set may be accomplished by an 
additional statistic, a survival model. This statistic generates the probability of tumor distant 
metastases as a function of time since initial diagnosis. A number of models may be used, 
including Weibull, normal, log-normal, log logistic, log-exponential, or log-Rayleigh 
(Chapter 12 "Life Testing", S-PLUS 2000 Guide to Statistics, Vol. 2, p. 368 (2000)). For 
the "normal" model, the probability of distant metastases P at time / is calculated as 
[0098] P = axQxp(-t 2 /r 2 ) Equation 

(3) 

[0099] where a is fixed and equal to 1, and X is a parameter to be fitted and measures the 
"expected lifetime". 

[00100] It is preferable that the above marker identification process be iterated one or more 
times by excluding one or more samples from the marker selection or ranking {i.e., from the 
calculation of correlation). Those samples being excluded are the ones that can not be 
predicted correctly from the previous iteration. Preferably, those samples excluded from 
marker selection in this iteration process are included in the classifier performance 
evaluation, to avoid overstating the performance. 

[00101] It will be apparent to those skilled in the art that the above methods, in particular the 
statistical methods described above, are not limited to the identification of markers associated 
with the prognosis of breast cancer within a particular patient subset, but may be used to 
identify set of marker genes associated with any phenotype or condition, or with any subset 
of a phenotype or condition defined by one or more characteristics of the phenotype or 
condition. The phenotype or condition can be the presence or absence of a disease such as 
cancer, or the presence or absence of any identifying clinical condition associated with that 
cancer. In the disease context, the phenotype may be a prognosis such as a survival time, 
probability of distant metastases of a disease condition, or likelihood of a particular response 
to a therapeutic or prophylactic regimen. The phenotype need not be cancer, or a disease; the 
phenotype may be a nominal characteristic associated with a healthy individual. 
[00102] Thus, the invention provides a method of identifying a set of informative genes or 
markers for a condition comprising a plurality of phenotypic of genotypic characteristics, 
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comprising: (a) classifying each of a plurality of samples or individuals on the basis of one or 
more phenotypic or genotypic characteristics of said condition into a plurality of first classes; 
(b) identifying within each of said first classes a first set of genes or markers informative for 
said condition, wherein said first set of genes or markers within each of said first classes is 
unique to said class relative to other classes. In a specific embodiment, samples or 
individuals in at least one of said first classes are additionally classified on the basis of a 
phenotypic or genotypic characteristic different from that used to distinguish said first classes 
into a plurality of second classes, and identifying within at least one of said second classes a 
second set of informative genes or markers, wherein said second set of informative genes or 
markers within each of said second classes is unique to said second class relative to other 
classes. In another embodiment, the invention provides a method of identifying a set of 
informative genes or markers for a condition comprising a plurality of phenotypic or 
genotypic characteristics, comprising: (a) classifying each of a plurality of samples or 
individuals on the basis of one or more phenotypic or genotypic characteristics into a 
plurality of first classes; (b) classifying at least one of said first classes into a plurality of 
second classes on the basis of phenotypic or genotypic characteristic different than that used 
to distinguish said plurality of first classes; (c) identifying within at least one of said first 
classes or said second classes a set of genes or markers informative for said condition, 
wherein said set of genes or markers is unique to said class relative to other classes. The 
invention further provides a method of identifying a set of informative genes or markers for a 
condition comprising a plurality of phenotypic or genotypic characteristics, comprising: (a) 
selecting a first characteristic from said plurality of phenotypic or genotypic characteristics; 
(b) identifying at least two first condition classes differentiable by said first characteristic; (c) 
selecting a plurality of individuals classifiable into at least one of said first condition classes; 
and (d) identifying in samples derived from each of said plurality of individuals a set of genes 
or markers informative for said condition within said at least one of said first condition 
classes. 

5.3.3 CLASSIFIER GENESETS FOR FIVE PATIENT SUBSETS 
[00103] The present invention provides sets of markers useful for the prognosis of breast 
cancer. The markers were identified according to the above methods in specific subsets of 
individuals with breast cancer. Generally, the marker sets were identified within a population 
of breast cancer patients that had been first stratified into five phenotypic categories based on 
criteria relevant to breast cancer prognosis, including estrogen receptor (ER) status, lymph 
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node status, type of mutation(s) (i.e., BRCAl-type or sporadic) and age at diagnosis. More 
specifically, patients, and tumors from which samples were taken, were classified as ER~ 
sporadic (i.e., being both estrogen receptor negative and having a non~£i?C4i-type tumor); 
ER~ BRCA1 (i.e., being both estrogen receptor negative and having a BRCAl-type tumor); 
ER+ 5 ER/AGE high (i.e., estrogen receptor positive with a high ratio of the log (ratio) of 
estrogen receptor gene expression to age); ER+, ER/AGE low, LN+ (i.e., estrogen receptor 
positive with a low ratio of the log (ratio) of estrogen receptor gene expression to age, lymph 
node positive); and ER + 5 ER/AGE low, LN~ (i.e., estrogen receptor positive with a low ratio 
of the log (ratio) of estrogen receptor gene expression to age, lymph node negative). The 
rationale for subdivision of the original patient set into these five subsets is detailed in the 
Examples (Section 6). The marker sets useful for each of the subsets above are provided in 
Tables 1-5, respectively. 



Table 1: Geneset of 20 markers used to classify ER , sporadic individuals. 



Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


Sp_xref_keyword_Hst 


SEQ ID 


AF055033 


IGFBP5 


-2.12 


0.88 


0.54 


insulin-like 
growth factor 
binding protein 5 


Growth factor binding, 
Glycoprotein, Signal, 3D- 
structure 


11 


NM_000599 


1GFBP5 


-3.41 


0.43 


0.53 


insulin-like 
growth factor 
binding protein 5 


Growth factor binding, 
Glycoprotein, Signal, 3D- 
structure 


51 


L27560 


IGFBP5 


-4.55 


0 


0.52 


EST 


Hypothetical protein 


29 


AF052162 


FLJ 12443 


-0.27 


1.6 


0.52 


EST 


Hypothetical protein 


9 


NM_001456 


FLNA 


-0.61 


2.47 


0.52 


filamin A, alpha 
(actin binding 
protein 280) 


Hypothetical protein, 
Actin-binding, 
Phosphorylation, Repeat, 
Polymorphism, Disease 
mutation 


73 


NM_002205 


1TGA5 


-0.37 


2.08 


0.49 


integrin, alpha 5 
(fibronectin 
receptor, alpha 
polypeptide) 


Integrin, Cell adhesion, 
Receptor, Glycoprotein, 
Transmembrane, Signal, 
Calcium, Repeat 


93 


NM_013261 


PPARGC1 


0.09 


1.54 


0.47 


peroxisome proliferative activated receptor, 
gamma, coactivator 1 


231 


NM_001605 


AARS 


0.39 


2.36 


0.51 


alanyl-tRNA 
synthetase 


Aminoacyl-tRNA 
synthetase, Protein 
biosynthesis, Ligase, 
ATP-binding 


77 


X87949 


HSPA5 


-0.03 


2.03 


0.49 


heat shock 
70kDa protein 5 
(glucose- 
regulated 
protein, 78kDa) 


ATP-binding, Hypothetical 
protein, Endoplasmic 
reticulum, Signal 


273 


Contig50950_RC 


NGEF 


-1.17 


3.2 


0.52 


neuronal guanine nucleotide exchange 
factor 


337 ; 
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Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


S p_xref_key wo rd_l ist 


SEQ ID 


NM_005689 


ABCB6 


-0.51 


2.26 


0.48 


ATP-binding 
cassette, sub- 
family B 
(MDR/TAP), 
member 6 


ATP-binding, Transport, 
Transmembrane, 
Mitochondrion, Inner 
membrane, Transit 
peptide, Hypothetical 
protein 


187 


NM_004577 


PSPH 


-0.56 


3.05 


0.51 


phosphoserine 
phosphatase 


Hydrolase, Serine 
biosynthesis, Magnesium, 
Phosphorylation 


151 


NM_003832 


PSPHL 


-2.08 


2.18 


0.5 


phosphoserine 

phosphatase- 

like 




131 


NMJ302422 


MMP3 


-0.96 


2.54 


0.5 


matrix 

metalloproteinas 
e 3 (stromelysin 
1 , progelatinase) 


Hydrolase, 
Metalloprotease, 
Glycoprotein, Zinc, 
Zymogen, Calcium, 
Collagen degradation, 
Extracellular matrix, 
Signal, Polymorphism, 
ou-s tructu re 


101 


Contig37562_RC 




-3.42 


-6.02 


-0.59 


ESTs 




293 


NM_018465 


MDS030 


-0.82 


-3.28 


-0.58 


uncharacterized 
hematopoietic 
stem/progenitor 
cells protein 
MDS030 


Hypothetical protein 




Contig54661_RC 




-0.79 


-2.08 


-0.54 


ESTs 




349 


AB032969 


KIAA1143 


-0.6 


-2.85 


-0.53 


KIAA1143 
protein 


Hypothetical protein 


1 


Contig55353_RC 


KIAA1915 


-0.27 


-1.82 


-0.47 


KIAA1915 
protein 


Hypothetical protein 


353 


NMJD05213 


CSTA 


2.11 


-3.4 


-0.49 


cystatin A (stefin 
A) 


Thiol protease inhibitor, 
3D-structure 


175 



Table 2. Geneset of 10 markers used to classify ER 9 BRCA1 individuals. 



Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Sequence 
name 


Description 


Sp xref keywo 
rdjist 


SEQ ID 


AF005487 




6.08 


0.5 


-0.79 


HLA-DRB6 


Homo sapiens 
MHC class II 
antigen (DRB6) 
mRNA, HLA- 
DRB6*0201 allele, 
sequence. 


MHC 


3 


Contig50728_RC 
Contig53598_RC 




4.02 


0.25 


-0.77 




ESTs, Weakly similar to S26650 
DNA-binding protein 5 - human 
[H. sapiens] 


333 


8.41 


3.26 


-0.77 


FLJ11413 


hypothetical protein 
FLJ11413 


Hypothetical 
protein 


343 I 


NM_002888 


RARR 
ES1 


6.9 


0.05 


-0.87 


RARRES1 


retinoic acid 
receptor responder 
(tazarotene 
induced) 1 


Receptor, 
Transmembrane 
, Signal-anchor 


109 
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NM_005218 


DEFB1 


5.14 


-3.02 


-0.81 


DEFB1 


defensin, beta 1 


Antibiotic, 
Signal, 3D- 
structure 


177 


U 17077 


BENE 


2.72 


-1.72 


-0.77 


BENE 


BENE protein • 


Transmembrane 


271 


Contig14683_RC 




1.29 


-2.31 


-0.74 




ESTs 




279 


Contig53641_RC 




-3.29 


4.23 


0.75 


MAGE-E1 


MAGE-E1 protein 


Hypothetical 
protein 


345 


Contig56678_RC 




-6.7 


-9.73 


-0.82 




ESTs, Highly similar to 
THYAJHUMAN Prothymosin alpha 
[H. sapiens] 


357 


NMJD05461 


KRML 


0.88 


-3.38 


-0.75 


MAFB 


v-maf 

musculoaponeuroti 
c fibrosarcoma 
oncogene homolog 
B (avian) 


Transcription 

regulation, 

Repressor, 

DNA-binding, 

Nuclear protein, 

Hypothetical 

protein 


181 



Table 3. Geneset of 50 markers used to classify ER+ 5 ER/AGE high individuals. 



Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NMJD03600 


STK15 


-2.93 


2.08 


0.8 


serine/threonine 
kinase 6 


ATP-binding, 
Kinase, 

Serine/threonine- 
protein kinase, 
Transferase 


125 


NM_003158 


STK6 


-1.57 


1.42 


0.78 


serine/threonine 
kinase 6 


ATP-binding, 
Kinase, 

Serine/threonine- 
protein kinase, 
Transferase 


113 


NMJ)07019 


UBCH10 


-2.98 


2.62 


0.81 


ubiquitin-conjugating 
enzyme E2C 


Hypothetical 
protein, Ubl 
conjugation 
pathway, Ligase, 
Multigene family, 
Mitosis, Cell cycle, 
Cell division 


217 


NM_013277 


ID-GAP 


-2.43 


2.43 


0.77 


Rac GTPase 
activating protein 1 


Hypothetical 
protein 


233 


NM_004336 


BUB1 


-2.04 


1.39 


0.77 


BUB1 budding 
uninhibited by 
benzimidazoles 1 
homolog (yeast) 


Transferase, 
Serine/threonine- 
protein kinase, 
ATP-binding, Cell 
cycle, Nuclear 
protein, Mitosis, 
Phosphorylation, 
Polymorphism 


147 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NM_006607 


PTTG2 


-1.71 


1.49 


0.72 


pituitary tumor- 
transforming 2 




211 


AK001166 


FU 11252 


-1.33 


0.99 


0.71 


hypothetical protein 
FLJ11252 


Hypothetical 
protein 


13 


NM_004701 


CCNB2 


-4.62 


2.01 


0.81 


cyclin B2 


Cyclin, Cell cycle, 
Cell division, 
Mitosis 


153 


Contig57584_RC 




-3.68 


2.04 


0.78 


likely ortholog of mouse gene rich 
cluster, C8 gene 


359 


NM_006845 


KNSL6 


-4.13 


1.05 


0.73 


kinesin-like 6 (mitotic 
centromere- 
associated kinesin) 


Hypothetical 
protein, Motor 
protein, 

Microtubules, ATP- 
binding, Coiled coil, 
Nuclear protein 


215 


Contig38901_RC 




-3.08 


1.15 


0.75 


hypothetical protein 
MGC45866 


Hypothetical f 
protein 


299 


NMJD18410 


DKFZp76 
2E1312 


-4.38 


1.49 


0.75 


hypothetical protein 
DKFZp762E1312 


Hypothetical 
protein 


263 


NM_003981 


PRC1 


-3.52 


2.17 


0.78 


protein regulator of 

ov/tnkinp^i^ 1 




133 


NMJ301809 


CENPA 


-5.04 


0.98 


0.75 


centromere protein 
A, 17kDa 


Hypothetical 
protein, 
Chromosomal 
protein, Nuclear 
protein, DNA- 
binding, 
Centromere, 
Antigen 


81 


NMJD03504 


CDC45L 


-2.67 


1.22 


0.73 


CDC45 cell division 
cycle 45-like (S. 
cerevisiae) 


DNA replication, 
Cell cycle, Nuclear 
protein, Cell 
division 


123 


Contig41413_RC 




-5.43 


2.15 


0.74 


ribonucleotide 
reductase M2 
polypeptide 


Oxidoreductase, 
DNA replication, 
Iron 


305 


NMJ304217 


STK12 


-2.17 


0.73 


0.72 


serine/threonine 
kinase 12 


Hypothetical 
protein, ATP- 
binding, Kinase, 
Serine/threonine- 
protein kinase, 
Transferase 


143 | 


NMJ)02358 


MAD2L1 


-2.65 


2.27 


0.83 


MAD2 mitotic arrest 
deficient-like 1 
(yeast) 


Cell cycle, Mitosis, 
Nuclear protein, 
3D-structure 


99 


NMJ314321 


ORC6L 


-2.73 


1.8 


0.75 


origin recognition 
complex, subunit 6 
homolog-like (yeast) 


Hypothetical 
protein, DNA 
replication, Nuclear 
protein, DNA- 
binding 


241 


NM_012291 


KIAA0165 


-1.52 


1.55 


0.71 


extra spindle poles 
like 1 (S. cerevisiae) 


Hypothetical 
protein 


229 
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Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NMJ304203 


PKMYT1 


-3.64 


2.2 


0.7 


retinoblastoma-like 2 
(p130) 


ATP-binding, 
Kinase, 

Serine/threonine- 
protein kinase, 
Transferase, 
Transcription 
regulation, DNA- 
binding, Nuclear 
protein, Cell cycle, 
Phosphorylation, 
Anti-oncogene 


137 


M96577 


E2F1 


-2.14 


1.42 


0.75 


E2F transcription 
factor 1 


Transcription 
regulation, 
Activator, DNA- 
binding, Nuclear 
protein, 

Phosphorylation, 
Cell cycle, 
Apoptosis, 
Polymorphism 


33 


NM_002266 


KPNA2 


-3.77 


1.78 


0.71 


karyopherin alpha 2 
(RAG cohort 1, 
mportin alpha 1) 


Transport, Protein 
transport, Repeat, 
Nuclear protein, 
Polymorphism 


95 


Contig31288_RC 




-2.63 


0.7 


0.68 


ESTs, Weakly similar to hypothetical 
protein FLJ20489 [Homo sapiens] 
H. sapiens] 


289 


NM_014501 


E2-EPF 


-1.55 


1.93 


0.7 


ubiquitin carrier 
protein 


Ubl conjugation 
pathway, Ligase, 
Multigene family 


247 


NM_001168 


BIRC5 


-5.76 


2.01 


0.78 


baculoviral IAP 
repeat-containing 5 
(survivin) 


Apoptosis, Thiol 
protease inhibitor, 
Alternative splicing, 
3D-structure, 
Hypothetical 
protein, Protease, 
Receptor 


63 


NM_003258 


TK1 


-4.57 


1.38 


0.71 


thymidine kinase 1, 
soluble 


Transferase, 
Kinase, DNA 
synthesis, ATP- 
binding 


115 


NM_001254 


CDC6 


-2.46 


0.28 


0.72 


CDC6 cell division 
cycle 6 homolog (S. 
cerevisiae) 


ATP-binding, Cell 
division 


67 


NM_004900 


DJ742C19 
.2 


-2.96 


0.13 


0.69 


apolipoprotein B 
mRNA editing 
enzyme, catalytic 
polypeptide-like 3B 


Hydrolase 


161 


NM_004702 


CCNE2 


-3.12 


2.13 


0.81 


cyclin E2 


Cell cycle, Cell 
division, Cyclin, 
Hypothetical 
protein, 

Phosphorylation, 
Alternative splicing, 
Nuclear protein 


155 
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Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 

poor 

xdev 


Corre- 
lation 


Description 


S p_x r ef _keywo rd 
_list 


SEQ ID 


AL160131 




-3.07 


2.42 


0.7 


hypothetical protein 
MGC861 


Hypothetical 
protein 


21 


NMJ316359 


LOC5120 
3 


-3.22 


2.61 


0.76 


nucleolar protein 
ANKT 


Hypothetical 
protein, Nuclear 
protein 


253 


NM_004856 


KNSL5 


-1.52 


1.1 


0.71 


kinesin-like 5 (mitotic 
kinesin-like protein 
1) 


Motor protein, Cell 
division, 

Microtubules, ATP- 
binding, Coiled coil, 
Mitosis, Cell cycle, 
Nuclear protein 


159 


NM_000057 


BLM 


-1.54 


0.76 


0.71 


Bloom syndrome 


Hydrolase, 
Helicase, ATP- 
binding, DNA- 
binding, Nuclear 
protein, DNA 
replication, 
Disease mutation 


35 


NM_018455 


BM039 


-2.44 


1.18 


0.7 


uncharacterized 
bone marrow protein 
BM039 




265 


NM_002106 


H2AFZ 


-2.49 


1.53 


0.72 


H2A histone family, 
member Z 


Chromosomal 
protein, 

Nucieosome core, 
Nuclear protein, 
DNA-binding, 
Multigene family 


91 


Contig64688 




-2.68 


3.1 


0.73 


hypothetical protein 
FLJ23468 


Hypothetical 
protein 


365 


Cpntig44289_RC 




-1.65 


1.6 


0.67 


ESTs 




315 


Contig28552_RC 




-1.37 


1.53 


0.68 


diaphanous homolog 
3 (Drosophila) 


Hypothetical 
protein, Coiled coil, 
Repeat, Alternative 
splicing 


281 


Contig46218JRC 




-1.31 


1.56 


0.68 


ESTs, Weakly similar to T19201 
hypothetical protein C1 1G6.3 - 
Caenorhabditis elegans [C. elegans] 


321 


Contig28947_RC 




-1.3 


0.98 


0.67 


cell division cycle 
25A 


Hypothetical 
protein, Cell 
division, Mitosis, 
Hydrolase, 
Alternative splicing, 
Multigene family, 
3D-structure 


283 


NMJ316095 


LOC5165 
9 


-1.4 


2.13 


0.67 


HSPC037 protein 


Hypothetical 
protein 


249 


NMJ)03090 


SNRPA1 


-3.26 


0.95 


0.7 


small nuclear 
ribonucleoprotein 
polypeptide A' 


Hypothetical 
protein, Nuclear 
protein, RNA- 
binding, 

Ribonucleoprotein, 
Leucine-rich 
repeat, Repeat, 
3D-structure 


1 1 1 
i i i 
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Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Corre- 
lation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NMJ>02811 


PSMD7 


-2.48 


1.89 


0.7 


proteasome 
(prosome, 
macropain) 26S 
subunit, non- 
ATPase, 7 (Mov34 
homolog) 


Proteasome 


107 


Contig38288_RC 




-2.34 


0.97 


0.67 


hypothetical protein 
DKFZp762A2013 


Hypothetical 
protein 


297 


NMJD03406 


YWHAZ 


-1.5 


2.79 


0.68 


tyrosine 3- 
monooxygenase/tryp 
tophan 5- 
monooxygenase 
activation protein, 
zeta polypeptide 


Brain, Neurone, 
Phosphorylation, 
Acetylation, 
Multigene family, 
3D-structure 


121 


AL1 37540 


NTN4 


2.13 


-4.61 


-0.69 


netrin 4 


Hypothetical 
protein, Laminin 
EGF-like domain, 
Signal 


19 


AL049367 




1.9 


-3.2 


-0.68 


EST 


Transducer, 
Prenylation, 
Lipoprotein, 
Multigene family, 
Acetyiaiion 


15 


NMJ313409 


FST 


1.04 


-5.78 


-0.69 


follistatin 


Glycoprotein, 
Repeat, Signal, 
Alternative splicing 


235 


NMJD00060 


BTD 


3.1 


-1.45 


-0.67 


biotinidase 


Hydrolase, 
Glycoprotein, 
Signal, Disease 
mutation 


37 



Table 4. Geneset of 50 markers used to classify ER+, ER/AGE low, LN+ individuals. 



S Accession/ 


Gene , 


Avg 


Avg 


Correl- 


Description 


Sp xref keyword 


SEQ ID 


I Contig No. 




good 


poor 


ation 




Jist 






xdev 


xdev 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NMJ306417 


MTAP44 


-1.5 


3 


0.69 


Fc fragment of IgG, 
low affinity lib, 
receptor for (CD32) 


Hydrolase, \ 
Hypothetical \ 
protein, 

Immunoglobulin 
domain, IgG- 
binding protein, 
Receptor, 
Transmembrane, 
Glycoprotein, j 
Signal, Repeat, 
Multigene family, 
Polymorphism, 
NAD, One-carbon 
metabolism, Serine 
protease, 
Zymogen, 
Protease, 

Alternative splicing, 

Chromosomal 

translocation, 

Proto-oncogene, 

Galaptin, Lectin, 

Antigen 


205 


NMJD06820 


GS3686 


-4.3 


4.06 


0.69 


chromosome 1 open 
reading frame 29 


Hypothetical 
protein 


213 


IMMJ301548 


IFIT1 


-3.4 


4.27 


0.71 


Interferon-induced 
protein with 
tetratricopeptide 
repeats 1 


Repeat, TPR 
repeat, Interferon 
induction 


75 


Contig41538_RC 




-2.5 


3.16 


0.68 


ESTs, Moderately similar to hypothetical 
protein FLJ20489 [Homo sapiens] 


307 


NMJD16816 


OAS1 


-1.7 


3.29 


0.75 


2',5-oligoadenylate 
synthetase 1 , 
40/46kDa 


RNA-binding, 

Transferase, 

N ucleotidyltransfer 

ase, Interferon 

induction, 

Alternative splicing 


255 


Contig51660_RC 




-2.1 


2.65 


0.66 


28kD interferon 
responsive protein 


Transmembrane 


339 


Contig43645_RC 




-4.8 


1.44 


0.63 


Homo sapiens, 
clone 

IMAGE:4428577, 
mRNA, partial cds 


Hypothetical 
protein 


313 


AF026941 




-4.6 


2.71 


0.63 


EST, Weakly similar 
to 2004399A 
chromosomal 
protein [Homo 
sapiens] 


Hypothetical 
protein 


5 


NM_007315 


STAT1 


-3.5 


1.8 


0.59 


signal transducer 
and activator of 
transcription 1 , 
91kDa 


Transcription 
regulation, DNA- 
binding, Nuclear 
protein, 

Phosphorylation, 
SH2 domain, 
Alternative splicing, 
3D-structure 


225 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NMJD02038 


G1P3 


-4.1 


5.64 


0.79 


interferon, alpha- 
inducible protein 
(clone 1FI-6-16) 


Interferon I 
induction, 

Transmembrane, j 
Signal, Alternative 
splicing 


85 | 


NM_005101 


ISG15 


-5.6 


5.34 


0.77 


interferon-stimuiated 
protein, 15 kDa 


Interferon 
induction, Repeat 


169 


NM_002462 


MX1 


-6.1 


0.83 


0.56 


rnyxovirus (influenza 
virus) resistance 1, 
interferon-inducible 
protein p78 (mouse) 


Hypothetical 
protein, Interferon 
induction, GTP- 
binding, Multigene 
family, Antiviral 


103 


NMJD05532 


IFI27 


-5.8 


2.81 


0.59 


interferon, alpha- 
inducible protein 27 


Interferon 
induction, 
Transmembrane 


183 


NMJDQ2346 


LY6E 


-2.1 


3.58 


0.75 


lymphocyte antigen 
6 complex, locus E 


Signal, Antigen, 
Multigene family, 
Membrane, GPI- 
anchor 


97 


NM_016817 


OAS2 


-3.6 


1.89 


0.59 


2 f -5'-oligoadenylate 
synthetase 2, 
69/71 kDa 


RNA-binding, 

Transferase, 

N ucleotidyltransfer 

ase, Repeat, 

Interferon 

induction, 

Alternative splicing, 
Myristate 


257 


Contig44909_RC 




-2.3 


1.13 


0.55 


hypothetical protein 
BC012330 


Hypothetical 
protein 


317 


NMJ317414 


USP18 


-4.1 


3.37 


j 0.72 


ubiquitin specific 
protease 18 


Ubl conjugation 
pathway, 
Hydrolase, Thiol 
protease, 
Multigene family 


259 


NMJD04029 


IRF7 


-2.4 


3.67 


0.66 


interferon regulatory 
factor 7 


Collagen, 
Transcription 
regulation, DNA- 
binding, Nuclear 
protein, Activator, 
Alternative splicing 


135 


NM_004335 


BST2 


-3.2 


3.22 


0.57 


bone marrow 
stromal cell antigen 
2 


Transmembrane, 
Glycoprotein, 
Signal-anchor, 
Polymorphism 


145 


NM_002759 


PRKR 


-2.4 


1.8 


0.58 


protein kinase, 
interferon-inducible 
double stranded 
RNA dependent 


Transferase, 
Serine/threonine- 
protein kinase, 
ATP-binding, 
Repeat, 

Phosphorylation, 
Interferon 
induction, RNA- 
binding, 3D- 
structure 


105 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NM__006332 


IFI30 


-3.8 


2.65 


0.64 


interferon, gamma- ! 
inducible protein 30 


Oxidoreductase, 

Interferon 

induction, 

Glycoprotein, 

Lysosome, Signal, 

Hypothetical 

protein 


203 


NM_009587 


LGALS9 


-3.2 


2.08 


0.6 


lectin, galactoside- 
binding, soluble, 9 
(galectin 9) 


Galaptin, Lectin, 
Repeat, Alternative 
splicing 


227 


NMJ303641 


IFITM1 


-2.4 


5.54 


0.63 


interferon induced 
transmembrane 
protein 1 (9-27) 


Interferon 
induction, 
Transmembrane 


127 


NMJD17523 


HSXIAPA 
F1 


-1 


2.84 


0.7 


XIAP associated 
factor-1 


Hypothetical 
protein 


261 


NMJ314314 


RIG-I 


-1.3 


3.55 


0.62 


RNA helicase 


ATP-binding, 

Helicase, 

Hydrolase, 

Hypothetical 

protein 


239 


Contig47563_RC 




-2.2 


3.11 


0.56 


ESTs 




325 


AI497657JRC 




-4.4 


5.61 


0.74 


guanine nucleotide 
binding protein 4 


Transducer, 
Prenylation, 
Lipoprotein, 
Multigene family 


335 


NM_000735 


CGA 


-4.3 


2.5 


0.58 


glycoprotein 
hormones, alpha 
polypeptide 


Hormone, 
Glycoprotein, 
Signal, 3D- 
structure 


53 


NM_004988 


MAGEA1 


-1.4 


6.31 


0.64 


melanoma antigen, 
family A, 1 (directs 
expression of 
antigen MZ2-E) 


Antigen, Multigene 
family, 

Polymorphism, 
Tumor antigen 


163 


Contig54242_RC 




-1.2 


4.1 


0.65 


chromosome 17 
open reading frame 
26 


Hypothetical 
protein 


347 


NM_004710 


SYNGR2 


-1.4 


3.01 


0.54 


synaptogyrin 2 


Transmembrane 


157 


NM_001168 


BiRC5 


-3.7 


3.39 


0.64 


baculoviral IAP 
repeat-containing 5 
(survivin) 


Hypothetical 
protein, Protease, 
Receptor, 
Apoptosis, Thiol 
protease inhibitor, 
Alternative splicing, 
3D-structure 


63 


Contig41413_RC 




-4.4 


2.61 


0.57 


ribonucleotide 
reductase M2 
polypeptide 


Oxidoreductase, 
DNA replication, 
Iron 


305 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


NM_004203 


PKMYT1 


-3.4 


3.79 


0.6 


retinoblastoma-like 2 
( P 130) 


ATP-binding, 
Kinase, 

Serine/threonine- 
protein kinase, 
Transferase, 
Transcription 
regulation, DNA- 
binding, Nuclear 
protein, Cell cycle, 
Phosphorylation, 
Anti-oncogene 


137 


Contig48913_RC 




-3.1 


1.72 


0.55 


Homo sapiens, Similar to hypothetical 
protein PR01722, clone MGC: 15692 
IMAGE:3351479, mRNA, complete cds 


327 


NMJD05804 


DDXL 


-2.5 


1.42 


0.58 


DEAD/H (Asp-Glu- 
Ala-Asp/His) box 
polypeptide 39 


ATP-binding, 

Helicase, 

Hydrolase, 

Hypothetical 

protein 


191 


NMJ) 16359 


LOC5120 
3 


-1.7 


3.6 


0.57 


nucleolar protein 
ANKT 


Hypothetical 
protein, Nuclear 
protein 


253 


NMJ301645 


APOC1 


-2.9 


3.43 


0.58 


apolipoprotein C-l 


Plasma, Lipid 
transport, VLDL, 
Signal, 3D- 
structure, 
Polymorphism 


79 


Contig37895_RC 




-2 


2.05 


0.55 


ESTs 




295 


NM_005749 


TOB1 


-1.3 


4.96 


0.59 


transducer of 
ERBB2, 1 


Phosphorylation 


189 


NMJ300269 


NME1 


-1.3 


2.98 


0.55 


non-metastatic cells 
1, protein (NM23A) 
expressed in 


Transferase, 
Kinase, ATP- 
binding, Nuclear 
protein, Anti- 
oncogene, Disease 
mutation 


39 


NM_014462 


LSM1 


-1 


4.5 


0.57 


Lsm1 protein 


Nuclear protein, 
Ribonucleoprotein, 
mRNA splicing, 
mRNA processing, 
RNA-binding 


245 


Contig31221_RC 




-1.4 


3.83 


0.56 


HTPAP protein 




287 


NM_005326 


HAGH 


-1.9 


4.29 


0.57 


hydroxyacyl 

glutathione 

hydrolase 


Hydrolase, Zinc, 
3D-structure 


179 


Contig42342_RC 




0.78 


-3.2 


-0.6 


Homo sapiens cDNA 
FLJ39417 fis, clone 
PLACE6016942 


Hypothetical 
protein 


311 


AL1 37540 


NTN4 


2.24 


-3.9 


-0.6 


netrin 4 


Laminin EGF-like 
domain, Signal, 
Hypothetical 
protein \ 


19 


Contig40434_RC 




1.64 


-5.6 


-0.6 


wingless-type MMTV 
integration site 
family, member 5A 


Developmental 
protein, 
Glycoprotein, 
Signal 


301 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Description 


Sp xref keyword 
Jist 


SEQ ID 


Contig1632_RC 




1.03 


-3.9 


-0.6 


hypothetical protein 
MGC17921 


Hypothetical 
protein 


275 


NM_014246 


CELSR1 


0.95 


-4.6 


-0.6 


cadherin, EGF LAG 
seven-pass G-type 
receptor 1 (flamingo 
homolog, 
Drosophila) 


G-protein coupled 
receptor, 
Transmembrane, 
Glycoprotein, EGF- 
like domain, 
Calcium-binding, 
Laminin EGF-like 
domain, Repeat, 
Developmental 
protein, 
Hydroxylation, 
Signal, Alternative 
splicing, 
Hypothetical 
protein 


237 


NM_005139 


ANXA3 


1.26 


-6.2 


-0.6 


annexin A3 


Annexin, 

Calcium/phospholi 

pid-binding, 

Repeat, 

Phospholipase A2 
inhibitor, 3D- 
structure, 
Polymorphism 


171 



Table 5. Geneset of 65 markers used to classify ER+ ? ER/AGE low, LN individuals. 



Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Sequence 
name 


Description 


Sp xref keywo 
rdjist 


SEQ ID 


M55914 


MPB1 


-2.82 


1.25 


0.5 


EN01 


enolase 1 , (alpha) 


DNA-binding, 

Transcription 

regulation, 

Repressor, 

Nuclear protein, 

Lyase, 

Glycolysis, 

Magnesium, 

Multigene 

family, 

Hypothetical 

protein 


31 


NMJ305945 


MPB1 


-3.06 


1.19 


0.49 


EN01 


Homo sapiens 
enolase 1, (alpha) 
(EN01), mRNA. 


Glycolysis, 

Hypothetical 

protein, Lyase, 

Magnesium, 

DNA-binding, 

Transcription 

regulation, 

Repressor, 

Nuclear protein, 

Multigene family 


193 
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Accession/ 
Contig No. 


Gene 


Avg 

good 

xdev 


Avg 
poor 
xdev 


Correl- 
ation 


• Sequence 
name 


Description 


Sp xref keywo 
rdjist 


SEQ ID 


NMJ)01428 


ENOI 


-2.52 


S 1.18 


0.46 


I EN01 


enolase 1 , (alpha) 


DNA-binding, 

Transcription 

regulation, 

Repressor, 

Nuclear protein, 

Lyase, 

Glycolysis, 

Magnesium, 

Multigene 

family, 

Hypothetical 

protein 


71 


NM_001216 


CA9 


-4.72 


1.49 


0.6 


CA9 


carbonic 
anhydrase IX 


Lyase, Zinc, 
Transmembrane 
, Glycoprotein, 
Antigen, Signal, 
Nuclear protein, 
Polymorphism 


65 


NM_001124 


ADM 


-5.68 


2.99 


0.56 


ADM 


Adrenomedullin 


Hormone, 
Amidation, 
Cleavage on 
pair of basic 
residues, Signal 


61 


NM_000584 


IL8 


-2.45 


2.04 


0.54 


IL8 


interleukin 8 


Cytokine, 

Chemotaxis, 

Inflammatory 

response, 

Signal, 

Alternative 

splicing, 3D- 

structure 


49 


D25328 


PFKP 


-4.19 


3.29 


0.56 


PFKP 


Phosphofructo- 
kinase, platelet 


Kinase, j 

Transferase, 

Glycolysis, 

Repeat, 

Allosteric 

enzyme, 

Phosphorylation, 

Magnesium, 

Multigene family 


25 


NM_006096 


NDRG1 


-5.45 


5.97 


0.77 


NDRG1 


N-myc downstream 
regulated gene 1 


Hypothetical 
protein, Nuclear 
protein, Repeat 


199 


NMJD04994 


MMP9 


-5.53 


1.07 


0.49 


MMP9 


matrix 

metalloproteinase 
9 (gelatinase B, 
92kDa gelatinase, 
y^Kua type iv 
sollagenase) 


Hydrolase, 

Metalloprotease, 

Glycoprotein, 

Zinc, Zymogen, 

Calcium, 

Collagen 

degradation, 

Extracellular 

matrix, Repeat, 

Signal, 

Polymorphism, 
3D-structure 


165 
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Accession/ 
Contig No. 


Gene 


Avg 
good 
xdev 


Avg 
poor 
xdev 


Correl- 
ation 


Sequence 
name 


Description 


Sp xref keywo 
rdjist 


SEQ ID 


NM_003311 


TSSC3 


-4.57 


5.58 


0.68 


TSSC3 


tumor suppressing subtransferable 
candidate 3 


117 


NMJD06086 


TUBB4 


-5.19 


2.85 


0.59 


TUBB4 


tubulin, beta, 4 


G-protein 

coupled 

receptor, 

Transmembrane 

, Glycoprotein, 

Phosphorylation, 

Lipoprotein, 

Palmitate, 

Polymorphism, 

Hypothetical 

protein, GTP- 

binding, 

Receptor, 

Microtubules, 

Multigene family 


! 197 


NM_006115 


PRAME 


-4.48 


2.77 


0.61 


PRAME 


preferentially 
expressed antigen 
in melanoma 


Antigen 


201 


NMJ304345 


CAMP 


-2.02 


1.37 


0.49 


CAMP 


cathelicidin 

antimicrobial 

peptide 


Antibiotic, Signal 


149 


NMJD 18455 


BM039 


-2.34 


0.76 


0.47 


BM039 


uncharacterized 
bone marrow 
protein BM039 




265 


Contig49169_RC 




-1.17 


1.5 


0.46 


SUV39H2 


suppressor of 
variegation 3-9 
(Drosophila) 
lomolog 2; 
hypothetical protein 
FLJ23414 


Hypothetical 
protein, Nuclear 
protein 


329 


Contig45032_RC 




-1.37 


0.77 


0.45 


FLJ14813 


hypothetical protein 
FLJ14813 


Hypothetical 
protein, ATP- 
binding, Kinase, 
Serine/threonine 
-protein kinase, 
Transferase 


319 


NM_000917 


P4HA1 


-1.54 


4.31 


0.62 


P4HA1 


procollagen- 
proline, 2- 
oxoglutarate 4- 
dioxygenase 
proline 4- 

hydroxylase), alpha 
polypeptide I 


Dioxygenase, 

Collagen, 

Oxidoreductase, 

Iron, Vitamin C, 

Alternative 

splicing, 

Glycoprotein, 

Endoplasmic 

reticulum, Signal 


57 


NMJ302046 


GAPD 


-2.51 


3.42 


0.6 


GAPD 


glyceraldehyde-3- 

Dhosphate 

dehydrogenase 


Glycolysis, NAD, 
Oxidoreductase, 
Hypothetical 
Drotein, 

Vlultigene family 


87 
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NM_000365 


TPI1 


-1.81 


2.94 


0.56 


TPI1 


triosephosphate 
isomerase 1 


Fatty acid 

biosynthesis, 

Gluconeogenesi 

s, Glycolysis, 

Isomerase, 

Pentose shunt, 

Disease 

mutation, 

Polymorphism, 

3D-structure 


45 


NMJD14364 


GAPDS 


-1.08 


2.88 


0.58 


GAPDS 


glyceraidehyde-3- 
phosphate 
dehydrogenase, 
testis-specific 


Glycolysis, 

Oxidoreductase, 

NAD 


243 


NM_005566 


LDHA 


-2.01 


4.01 


0.59 


LDHA 


lactate 

dehydrogenase A 


Oxidoreductase, 
NAD, Glycolysis, 
Multigene 
family, Disease 
mutation, 
Polymorphism 


| 185 


NM__000291 


PGK1 


-2.28 


1.68 


0.51 


PGK1 


phosphoglycerate 
kinase 1 


Kinase, 

Transferase, 

Multigene 

family, 

Glycolysis, 

Acetylation, 

Disease 

mutation, 

Polymorphism, 

Hereditary 

hemolytic 

anemia 


41 


NM__016185 


LOC51 1 
55 


-2.33 


2.82 


0.59 


HN1 


hematological and 
neurological 
expressed 1 




251 


NMJ301168 


BIRC5 


-4.33 


2.78 


0.55 


BIRC5 


baculoviral IAP 
repeat-containing 5 
(survivin) 


Apoptosis, Thiol 

protease 

inhibitor, 

Alternative 

splicing, 3D- 

structure, 

Hypothetical 

Drotein, 

Protease, 

Receptor 


63 


NMJD02266 


KPNA2 


-3.75 


1.34 


0.47 


KPNA2 


karyopherin alpha 
2 (RAG cohort 1, 
mportin alpha 1) 1 


Transport, ! 

3 rotein 

ransport, 

Repeat Nuclear 

□rotein, 

D olymorphism 


95 


Contig31288_RC 




-2.1 


1.27 


0.5 




ESTs, Weakly similar to hypothetical 
orotein FLJ20489 [Homo sapiens] 
H. sapiens] 


289 
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NM_000269 


NME1 


-2.15 


3.43 


0.55 


NME1 


non-metastatic 
cells 1 , protein 
(NM23A) 
expressed in 


Transferase, 
Kinase, ATP- 
binding, Nuclear 
protein, Anti- 
oncogene, 
Disease 
mutation 


39 


My nn'j'ifiQ 
I\JlVl_UUo1 do 


o I Ko 


-1 .23 


1 .73 


0.45 


STK6 


serine/threonine 
kinase 6 


ATP-binding, 
Kinase, 

Serine/threonine 
-protein kinase, 
Transferase 


113 


MM DDYOVA 




- I.OO 




U.ol 


|— > A /— > 1 1 


brain acyl-CoA 
hydrolase 


Hydrolase, 
Serine esterase, 
Repeat 


223 


Contig55188_RC 




-2.36 


3.28 


0.47 


FLJ22341 


hypothetical protein 
FLJ22341 


Hypothetical 
protein 


351 


INIVl UU^Ubl 


GCLM 


-1.06 


1.76 


0.48 


GCLM 


glutamate-cysteine 
ligase, modifier 
subunit 


Ligase, 

Glutathione 

biosynthesis 


89 


NMJD04207 


SLC16A 
3 


-3.11 


5.07 


0.67 


SLC16A3 


solute carrier family 
16 (monocarboxylic 
acid transporters), 
member 3 


Transport, 
Symport, 
Transmembrane 
, Multigene 
family 


139 


NMJ)00582 


SPP1 


-5.09 


5.47 


0.53 


SPP1 


secreted 

phosphoprotein 1 
(osteopontin, bone 
sialoprotein I, early 
T-lymphocyte 
activation 1) 


Hypothetical 

protein, 

Glycoprotein, 

Sialic acid, 

Biomineralizatio 

n, Cell adhesion, 

Phosphorylation, 

Signal, 

Alternative 

splicing 


47 


NMJ301109 


ADAM8 


-2.5 


3.74 


0.45 


ADAM8 


a disintegrin and 
metalloproteinase 
domain 8 


Hydrolase, 
Metailoprotease, 
Zinc, Signal, 
Glycoprotein, 
Transmembrane 
, Antigen 


59 


D50402 


SLC11A 

A 
I 


-1.05 


3.46 


0.53 


SLC11A1 


solute carrier family 
11 (proton-coupled 
divalent metal ion 
transporters), 
member 1 


Transport, Iron 

transport, 

Transmembrane 

, Glycoprotein, 

Macrophage, 

Polymorphism 


27 


AL080235 


DKFZP5 

86E162 

1 


-1.23 


1.96 


0.51 


RIS1 


Ras-induced 
senescence 1 


Hypothetical 

nrntpin 


17 


Contig40552_RC 




-1.26 


3.96 


0.54 


FLJ25348 


hypothetical protein 
FLJ25348 


Hypothetical 
protein 


303 


Contig52490_RC 




-0.64 


3.33 


0.61 


LOC 11623 
8 


hypothetical protein 
BC014072 




341 


NM_006461 


DEEPE 
ST 


-2.1 


1.85 


0.46 


SPAG5 


sperm associated 
antigen 5 


Hypothetical 
protein 


207 
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Contig56503_RC 




-4.3 


3.39 


0.55 


MGC9753 


hypothetical gene 
MGC9753 


Hypothetical 
protein 


355 


Contig63525 




-1.91 


3.34 


0.5 


FLJ 13352 


hypothetical protein 
FLJ 13352 


Hypothetical 
protein 


363 


NM_001909 


CTSD 


-0.83 


4.6 


0.51 


CTSD 


cathepsin D 
(lysosomal aspartyl 
protease) 


Hydrolase, 

Aspartyl 

protease, 

Glycoprotein, 

Lysosome, 

Signal, 

Zymogen, 

Polymorphism, 

Alzheimer's 

disease, 3D- 

structure 


83 


NM_005063 


SCD 


-2.57 


5.15 


0.48 


SCD 


stearoyl-CoA 
desaturase (delta- 
9-desaturase) 


Hypothetical 
protein, 
Endoplasmic 
reticulum, Fatty 
acid 

biosynthesis, 
Iron, 

Oxidoreductase, 
Transmembrane 


167 


NM_005165 


ALDOC 


-2.43 


5.02 


0.48 


ALDOC 


aldolase C, 

fructose- 

bisphosphate 


Lyase, Schiff 
base, 

Glycolysis, 
Multigene family 


173 


NMJD00363 


TNNI3 


-0.54 


3.58 


0.48 


TNN13 


troponin I, cardiac 


Hypothetical 

protein, Muscle 

protein, Actin- 

binding, 

Acetylation, 

Disease 

mutation, 

Cardiomyopathy 

, Receptor, 

Signal 


43 






-1 .Do 




n /it 


rAUol 


COT 

EST 


Heme, 

Hypothetical 

protein 


7 


Contia30875 RC 




-0.88 












Zoo 


NMJ318487 


HCA112 


-0.7 


3.54 


0.58 


HCA112 


hepatocellular 
carcinoma- 
associated antigen 
112 


Hypothetical 
protein 


269 


IMM_001323 


CST6 


-1.63 


3.84 


0.57 


CST6 


cystatin E/M 


Thiol protease 
inhibitor, Signal, 
Glycoprotein 


69 ; 
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NM_ 006516 


SLC2A1 


-1.66 


2.22 


0.46 


SLC2A1 


solute carrier family 
2 (facilitated 
glucose 
transporter), 
member 1 


Transmembrane 
, Sugar 
transport, 
Transport, 
Glycoprotein, 
Multigene 
family, Disease 
mutation 


209 


MM nA70C7 


I a 1/ » n 

LAK-4P 


-1.04 


3.28 


0.61 


EVIN1 


expressed in 
activated T/LAK 
lymphocytes 


Hypothetical 
protein 


221 


NMJD04710 


SYNGR 
2 


-0.84 


4.81 


0.56 


SYNGR2 


synaptogyrin 2 


Transmembrane 


157 


Contig63649_RC 




-1.34 


6.3 


0.75 




ESTs, Weakly simil, 
chromosomal protei 
sapiens] [H.sapiens 


ar to 2004399A 
n [Homo 


361 


NM_003376 


VEGF 


-2.12 


2.42 


0.46 


VEGF 


vascular 

endothelial growth 
factor 


Hypothetical 
protein, Mitogen, 
Angiogenesis, 
Growth factor, 
Glycoprotein, 
Signal, Heparin- 
binding, 
Alternative 
splicing, 
Multigene 
family, 3D- 
structure 


119 


NMJD00799 


EPO 


-0.75 


4.01 


0.69 


EPO 


erythropoietin 


Erythrocyte 

maturation, 

Glycoprotein, 

Hormone, 

Signal, 

Pharmaceutical, 
3D-structure 


55 


NMJ306014 


DXS987 
9E 


-1.85 


3.44 


0.54 


DXS9879E 


DNA segment on chromosome X 
(unique) 9879 expressed sequence 


195 


NM_007183 


PKP3 


-0.91 


4.14 


0.48 


PKP3 


plakophilin 3 


Cell adhesion, 
Cytoskeleton, 
Structural 
protein, Nuclear 
protein, Repeat 


219 


D 13642 


SF3B3 


-0.65 


2.28 


0.48 


SF3B3 


splicing factor 3b, 
subunit 3, 130kDa 


Hypothetical j 

protein, 

Spliceosome, 

mRNA 

processing, 

nRNA splicing, 

Nuclear protein 


23 


NM_003756 


EIF3S3 


-1.85 


2.19 


0.46 


EIF3S3 

1 
1 


eukaryotic 
translation initiation 
factor 3, subunit 3 
gamma, 40kDa 


nitiation factor, 

3 rotein 

oiosynthesis 


129 
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Contig47096_RC 




-0.41 


4.52 


0.54 


PFKFB4 


6-phosphofructo-2- 
kinase/fructose- 
2,6-biphosphatase 
4 


Kinase, 

Multifunctional 
enzyme, 
Transferase, 
Hydrolase, ATP- 
binding, 

Phosphorylation, 
Muitigene family 


323 


NMJ304209 


SYNGR 
3 


-0.31 


3.67 


0.53 


SYNGR3 


synaptogyrin 3 


Transmembrane 


141 


Contig3464_RC 




0.99 


-5.81 


-0.52 




ESTs 




277 


Contig31646_RC 




1.1 


-7.76 


-0.5 


COL14A1 


collagen, type XIV, 
alpha 1 (undulin) 


Extracellular 
matrix, 

Glycoprotein, 

Hypothetical 

protein, 

Collagen, Signal 


291 


Contig49388_RC 




1.73 


-1.75 


-0.51 


FLJ 13322 


hypothetical protein 
FLJ 13322 


Hypothetical 
protein 


331 


Contig41887_RC 




0.37 


-5.74 


-0.47 


LOC12422 
0 


similar to common 
salivary protein 1 


Hypothetical 
protein 


309 : 



[00104] 



5.4 DIAGNOSTIC AND PROGNOSTIC METHODS 
5.4.1 SAMPLE COLLECTION 
[00105] In the present invention, markers, such as target polynucleotide molecules or 
proteins, are extracted from a sample taken from an individual afflicted with a condition such 
as breast cancer. The sample may be collected in any clinically acceptable manner, but must 
be collected such that marker-derived polynucleotides (i.e., RNA) are preserved (if gene 
expression is to be measured) or proteins are preserved (if encoded proteins are to be 
measured). For example, mRNA or nucleic acids derived therefrom (i.e., cDNA or amplified 
DNA) are preferably labeled distinguishably from standard or control polynucleotide 
molecules, and both are simultaneously or independently hybridized to a microarray 
comprising some or all of the markers or marker sets or subsets described above. 
Alternatively, mRNA or nucleic acids derived therefrom may be labeled with the same label 
as the standard or control polynucleotide molecules, wherein the intensity of hybridization of 
each at a particular probe is compared. A sample may comprise any clinically relevant tissue 
sample, such as a tumor biopsy or fine needle aspirate, or a sample of bodily fluid, such as 
blood, plasma, serum, lymph, ascitic fluid, cystic fluid, urine or nipple exudate. The sample 
may be taken from a human, or, in a veterinary context, from non-human animals such as 
ruminants, horses, swine or sheep, or from domestic companion animals such as felines and 
canines. 
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[00106] Methods for preparing total and poly(A)+ RNA are well known and are described 
generally in Sambrook et ah, Molecular Cloning - A Laboratory Manual (2nd Ed.), 
Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York (1989)) and 
Ausubel et aL, Current Protocols in Molecular Biology, vol. 2, Current Protocols 
Publishing, New York (1994)). 

[00107] RNA may be isolated from eukaryotic cells by procedures that involve lysis of the 
cells and denaturation of the proteins contained therein. Cells of interest include wild-type 
cells (i.e., non-cancerous), drug-exposed wild-type cells, tumor- or tumor-derived cells, 
modified cells, normal or tumor cell line cells, and drug-exposed modified cells. Preferably, 
the cells are breast cancer tumor cells. 

[00108] Additional steps may be employed to remove DNA. Cell lysis may be 
accomplished with a nonionic detergent, followed by microcentrifugation to remove the 
nuclei and hence the bulk of the cellular DNA. In one embodiment, RNA is extracted from 
cells of the various types of interest using guanidinium thiocyanate lysis followed by CsCl 
centrifugation to separate the RNA from DNA (Chirgwin et aL 9 Biochemistry 18:5294-5299 
(1979)). Poly(A)+ RNA is selected by selection with oligo-dT cellulose (see Sambrook et 
ah, Molecular Cloning - A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring 
Harbor Laboratory, Cold Spring Harbor, New York (1989). Alternatively, separation of 
RNA from DNA can be accomplished by organic extraction, for example, with hot phenol or 
phenol/chloroform/isoamyl alcohol. 

[00109] If desired, RNase inhibitors may be added to the lysis buffer. Likewise, for certain 
cell types, it may be desirable to add a protein denaturation/digestion step to the protocol. 
[001 10] For many applications, it is desirable to preferentially enrich mRNA with respect to 
other cellular RNAs, such as transfer RNA (tRNA) and ribosomal RNA (rRNA). Most 
mRNAs contain a poly(A) tail at their 3' end. This allows them to be enriched by affinity 
chromatography, for example, using oligo(dT) or poly(U) coupled to a solid support, such as 
cellulose or Sephadex™ (see Ausubel et al., Current Protocols in Molecular jBiOLOGY, 
vol. 2, Current Protocols Publishing, New York (1994). Once bound, poly(A)4- mRNA is 
eluted from the affinity column using 2 mM EDTA/0.1% SDS. 

[001 1 1] The sample of RNA can comprise a plurality of different mRNA molecules, each 
different mRNA molecule having a different nucleotide sequence. In a specific embodiment, 
the mRNA molecules in the RNA sample comprise at least 5, 10, 15, 20, 25, 30, 40 or 50 
different nucleotide sequences. More preferably, the mRNA molecules of the RNA sample 
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comprise mRNA molecules corresponding to each of the marker genes. In another specific 
embodiment, the RNA sample is a mammalian RNA sample. 

[001 12] In a specific embodiment, total RNA or mRNA from cells are used in the methods of 
the invention. The source of the RNA can be cells of a plant or animal, human, mammal, 
primate, non-human animal, dog, cat, mouse, rat, bird, yeast, eukaryote, prokaryote, etc. In 
specific embodiments, the method of the invention is used with a sample containing total 
mRNA or total RNA from 1 x 10 6 cells or less. In another embodiment, proteins can be 
isolated from the foregoing sources, by methods known in the art, for use in expression 
analysis at the protein level. 

[001 13] Probes to the homologs of the marker sequences disclosed herein can be employed 
preferably when non-human nucleic acid is being assayed. 

[001 14] The methods of the invention may employ any molecule suitable as a marker. For 
example, sets of proteins informative for a particular condition, including a disease, may be 
determined. As for gene-based markers, levels of variations of different proteins in samples 
may be determined for phenotypic or genotypic subsets of the condition, and proteins 
showing significant variation in either level (abundance) or activity, or both, may be 
identified in order to create a set of proteins informative for one or more of these subsets. 
Such proteins may be identified, for example, by use of gel electrophoresis, such as one- 
dimensional polyacrylamide gel electrophoresis, two-dimensional polyacrylamide gel 
electrophoresis, nondenaturing polyacrylamide gel electrophoresis; isoelectric focusing gels, 
etc., by use of antibody arrays, etc. Of course, the particular template(s) used to classify the 
individual depends upon the type(s) of cellular constituents used as markers. For example, 
where nucleic acids {e.g., genes or nucleic acids derived from expressed genes) are used as 
markers, the template comprises nucleic acids (or the level of expression or abundance 
thereof); where proteins are used as markers, the template comprises proteins, for example, 
the level or abundance of those proteins in a set of individuals; etc. 

5.4.2 USE OF PROGNOSTIC GENESETS FOR BREAST CANCER 
[001 15] According to the present invention, once genesets informative for a plurality of 
subsets of a condition are identified, an individual is classified into one of these subsets and a 
prognosis is made based on the expression of the genes, or their encoded proteins, in the 
geneset for that subset in a breast cancer tumor sample taken from the individual. 
[001 16] For example, a particular hypothetical condition has three relevant phenotypic 
characteristics, A, B and C. In this example, based on these characteristics, genesets 
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informative for prognosis of four patient subsets A + B + ; A + B~C + ; A + B~C"; and A~ are 
identified by the method described above. Thus, an individual having the condition would 
first be classified according to phenotypes A-C into one of the four patient subsets. In one 
embodiment, therefore, the invention provides for the classification of an individual having a 
condition into one of a plurality of patient subsets, wherein a set of genes informative for 
prognosis for the subset has been identified. A sample is then taken from the individual, and 
the expression of the prognostically-informative genes in the sample is analyzed and 
compared to a control. In various embodiments, the control is the average expression of 
informative genes in a pool of samples taken from good prognosis individuals classifiable 
into that patient subset; the average expression of informative genes in a pool of samples 
taken from poor prognosis individuals classifiable into that patient subset; a set of 
mathematical values that represent gene expression levels of good prognosis individuals 
classifiable into that patient subset; etc. 

[001 17] In another embodiment, a sample is taken from the individual, and the levels of 
expression of the prognostically-informative genes in the sample is analyzed. In one 
embodiment, the expression level of each gene can be compared to the expression level of the 
corresponding gene in a control of reference sample to determine a differential expression 
level. The expression profile comprising expression levels or differential expression levels of 
the plurality of genes is then compared to a template profile. In various embodiments, the 
template profile is a good prognosis template comprising the average expression of 
informative genes in samples taken from good prognosis individuals classifiable into that 
patient subset; or a poor prognosis template comprising the average expression of informative 
genes in samples taken from poor prognosis individuals classifiable into that patient subset; 
or a good prognosis profile comprising a set of mathematical values that represent gene 
expression levels of good prognosis individuals classifiable into that patient subset; etc. 
[001 18] In a specific embodiment, the condition is breast cancer, and the phenotypic, 
genotypic and/or clinical classes are: ER~ BRCA1 individuals; ER~~, sporadic individuals; 
ER+, ER/AGE high individuals; ER+, ER/AGE low, LN+ individuals; and ER+, ER/AGE 
low, LN~ individuals. In this embodiment, an individual may be classified as ER+ or ER _ . 
If the individual is ER~, the individual is additionally classified as having a BRCA1 -type or 
sporadic tumor. ER~ individuals are thus classified as ER~ BRCA1 or ER~ sporadic. 
Alternatively, if the individual is classified as ER+, the individual is additionally classified as 
having a high or low ratio of the log (ratio) of the level of expression of the gene encoding 
the estrogen receptor to the individual's age. Individuals having a low ratio are additionally 
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classified as LN+ or LN- ER+ individuals are thus classified as ER+, ER/AGE high; ER+, 
ER/AGE low, LN+, or ER+, ER/AGE low, LN~ Of course, the individual's ER status, 
tumor type, age and LN status may be identified in any order, as long as the individual is 
classified into one of these five subsets. 

[001 19] Thus, in one embodiment, the invention provides a method of classifying an 
individual with a condition as having a good prognosis or a poor prognosis, comprising: (a) 
classifying said individual into one of a plurality of patient classes, said patient classes being 
differentiated by one or more phenotypic, genotypic or clinical characteristics of said 
condition; (b) determining the level of expression of a plurality of genes or their encoded 
proteins in a cell sample taken from the individual relative to a control, said plurality of genes 
or their encoded proteins comprising genes or their encoded proteins in a cell sample taken 
from the individual relative to a control, said plurality of genes or their encoded proteins 
comprising genes or their encoded proteins informative for prognosis of the patient class into 
which said individual is classified; and (c) classifying said individual as having a good 
prognosis or a poor prognosis on the basis of said level of expression. In a specific 
embodiment, said condition is breast cancer, said good prognosis is the non-occurrence of 
metastases within five years of initial diagnosis, and said poor prognosis is the occurrence of 
metastases within five years of initial diagnosis. In an more specific embodiment, said 
classifying said individual with a condition as having a good prognosis or a poor prognosis is 
carried out by comparing the level expression of each of said plurality of genes or their 
encoded proteins to said average level of expression of each corresponding gene or its 
encoded protein in said control, and classifying said individual as having a good prognosis 
poor prognosis if said level of expression correlates with said average level of expression of 
each of said genes or their encoded proteins in a good prognosis control or a poor prognosis 
control, respectively, more strongly than would be expected by chance. In a more specific 
embodiment of the method, said plurality of patient subsets comprises ER~~, BRCA1 
individuals; ER", sporadic individuals; ER+, ER/AGE high individuals; ER+, ER/AGE low, 
LN+ individuals; and ER+, ER/AGE low, LN~ individuals. In another embodiment, said 
control is the average level of expression of each of said plurality of genes informative for 
prognosis in a pool of tumor samples from individuals classified into said subset who have a 
good prognosis or good outcome, or who have a poor prognosis or good outcome. In another 
specific embodiment, said control is a set of mathematical values representing the average 
level of expression of genes informative for prognosis in tumor samples of individuals 
classifiable into said subset who have a good prognosis, or who have a poor prognosis. 
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[00120] It is evident that the different patient subsets described herein reflect different 
molecular mechanisms of the initiation of tumor formation and metastasis. Thus, the 
genesets listed in tables 1-5 are also useful for diagnosing a person as having a particular type 
of breast cancer in the first instance. Thus, the invention also provides a method of 
diagnosing an individual as having a particular subtype of breast cancer, comprising 
determining the level of expression in a sample from said individual of a plurality of the 
genes for which markers are listed in Tables 1-5; and comparing said expression to a control, 
where said control is representative of the expression of said plurality of genes in a breast 
cancer sample of said subtype of cancer, and on the basis of said comparison, diagnosing the 
individual as having said subtype of breast cancer. In a specific embodiment, said subtype of 
cancer is selected from the group consisting of ER~, BRCA1 type; ER~, sporadic type; ER+, 
ER/AGE high type; ER+, ER/AGE low, LN+ type; and ER/AGE low, LN~ type. In another 
specific embodiment, said control is the average level of expression of a plurality of the genes 
for which markers are listed in Table 1, Table 2, Table 3, Table 4 or Table 5. In another 
specific example, said comparing comprises determining the similarity of the expression of 
the genes for which markers are listed in each of Tables 1-5 in said sample taken from said 
individual to a control level of expression of the same genes for each of Tables 1-5, and 
determining whether the level of expression of said genes in said sample is most similar to 
said control expression of the genes for which markers are listed in Table 1, Table 2, Table 3, 
Table 4 or Table 5. 

[00121] In another embodiment, the invention provides a method of classifying an individual 
as having a good prognosis or a poor prognosis, comprising: (a) classifying said individual as 
ER~ BRCA1; ER~ sporadic; ER+, ER/AGE high; ER+, ER/AGE low, LN+; or ER+, 
ER/AGE low, LN~; (b) determining the level of expression of a first plurality of genes in a 
cell sample taken from the individual relative to a control, said first plurality of genes 
comprising two of the genes corresponding to the markers Table 1 if said individual is 
classified as ER"~, BRCA1; Table 2 if said individual is classified as ER~, sporadic; Table 3 if 
said individual is classified as ER+, ER/AGE high; Table 4 if said individual is classified as 
ER+, ER/AGE low, LN+; or Table 5 if said individual is classified as ER+, ER/AGE low, 
LINT, wherein said individual is "ER/AGE high" if the ratio of ER expression to age exceeds 
a predetermined value, and "ER/AGE low" if the ratio of ER expression to age does not 
exceed said predetermined value. In a specific embodiment of this method, said 
predetermined value of ER calculated as ER = 0.1 (AGE - 42.5), wherein AGE is the age of 
said individual. In another specific embodiment, said individual is ER"", BRCA1, and said 
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plurality of genes comprises (i.e., contains at least) 1, 2, 3, 4, 5, 10 or all of the genes for 
which markers are listed in Table 1. In another specific embodiment, said individual is ER~~, 
sporadic, and said plurality of genes comprises (i.e., contains at least) 1, 2, 3, 4, 5, 10 or all of 
the genes for which markers are listed in Table 2. In another specific embodiment, said 
individual is ER+, ER/AGE high, and said plurality of genes comprises (i.e., contains at least) 
1, 2, 3, 4, 5, 10 or all of the genes for which markers are listed in Table 3. In another specific 
embodiment, said individual is ER+, ER/AGE low, LN+, and said plurality of genes 
comprises (i.e., contains at least) 1, 2, 3, 4, 5, 10 or all of the genes for which markers are 
listed in Table 4. In another specific embodiment, said individual is ER+, ER/AGE low, 
LN~, and said plurality of genes comprises (i.e., contains at least) 1, 2, 3, 4, 5, 10 or all of the 
genes for which markers are listed in Table 5. In another specific embodiment, the method 
additionally comprises determining in said cell sample the level of expression, relative to a 
control, of a second plurality of genes for which markers are not found in Tables 1-5, wherein 
said second plurality of genes is informative for prognosis. 

[00122] In one embodiment, the invention provides a method of classifying an individual 
with a condition as having a good prognosis or a poor prognosis, comprising: (a) classifying 
said individual into one of a plurality of patient classes, said patient classes being 
differentiated by one or more phenotypic, genotypic or clinical characteristics of said 
condition; (b) determining the levels of expression of a plurality of genes or their encoded 
proteins in a cell sample taken from the individual, optionally relative to a control, said 
plurality of genes or their encoded proteins comprising genes or their encoded proteins 
informative for prognosis of the patient class into which said individual is classified; and (c) 
classifying said individual as having a good prognosis or a poor prognosis on the basis of said 
levels of expression. In a specific embodiment, said condition is breast cancer, said good 
prognosis is the non-occurrence of metastases within five years of initial diagnosis, and said 
poor prognosis is the occurrence of metastases within five years of initial diagnosis. In an 
more specific embodiment, said classifying said individual with a condition as having a good 
prognosis or a poor prognosis is carried out by comparing the patient's expression profile of 
said plurality of genes or their encoded proteins to a good and/or poor prognosis template 
profile of expression levels of said plurality of genes or their encoded proteins, and 
classifying said individual as having a good prognosis or poor prognosis if said patient 
expression profile has a high similarity to a good prognosis template or a poor prognosis 
template, respectively. In a more specific embodiment of the method, said plurality of patient 
subsets comprises ER~ BRCA1 individuals; ER~~, sporadic individuals; ER+, ER/AGE high 
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individuals; ER+, ER/AGE low, LN+ individuals; and ER+, ER/AGE low, LINT individuals. 
In another embodiment, said good prognosis template comprises the average level of 
expression of each of said plurality of genes informative for prognosis in tumor samples from 
individuals classified into said subset who have a good prognosis or good outcome, while 
said poor prognosis template comprises the average level of expression of each of said 
plurality of genes informative for prognosis in tumor samples from individuals classified into 
said subset who have a poor prognosis or poor outcome. In another specific embodiment, 
said good or poor prognosis template is a set of mathematical values representing the average 
level of expression of genes informative for prognosis in tumor samples of individuals 
classifiable into said subset who have a good prognosis, or who have a poor prognosis, 
respectively. 

[00123] It is evident that the different patient subsets described herein reflect different 
molecular mechanisms of the initiation of tumor formation and metastasis. Thus, the 
genesets listed in tables 1-5 are also useful for diagnosing a person as having a particular type 
of breast cancer in the first instance. Thus, the invention also provides a method of 
diagnosing an individual as having a particular subtype of breast cancer, comprising 
determining an expression profile of a plurality of the genes for which markers are listed in 
Tables 1-5 in a sample from said.individual; and comparing said expression profile to a 
template profile, where said template is representative of the expression of said plurality of 
genes in a breast cancer sample of said subtype of cancer, and on the basis of said 
comparison, diagnosing the individual as having said subtype of breast cancer. In a specific 
embodiment, said subtype of cancer is selected from the group consisting of ER~ BRCA1 
type; ER~ sporadic type; ER+, ER/AGE high type; ER+, ER/AGE low, LN+ type; and 
ER/AGE low, LN~ type. In another specific embodiment, said template comprises the 
average levels of expression of a plurality of the genes for which markers are listed in Table 
1, Table 2, Table 3, Table 4 or Table 5. In another specific example, said comparing 
comprises determining the similarity of the expression profile of the genes for which markers 
are listed in each of Tables 1-5 in said sample taken from said individual to a template profile 
comprising levels of expression of the same genes for each of Tables 1-5, and determining 
whether the pattern of expression of said genes in said sample is most similar to the pattern of 
expression of the genes for which markers are listed in Table 1, Table 2, Table 3, Table 4 or 
Table 5. 

[00124] In another embodiment, the invention provides a method of classifying an individual 
as having a good prognosis or a poor prognosis, comprising: (a) classifying said individual as 
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ER~, BRCA1; ER~ sporadic; ER+, ER/AGE high; ER+, ER/AGE low, LN+; or ER+, 
ER/AGE low, LN~; (b) determining an expression profile of a first plurality of genes in a cell 
sample taken from the individual relative to a control, said first plurality of genes comprising 
at least two of the genes corresponding to the markers Table 1 if said individual is classified 
as ER~~, BRCA1; Table 2 if said individual is classified as ER~, sporadic; Table 3 if said 
individual is classified as ER+, ER/AGE high; Table 4 if said individual is classified as ER+, 
ER/AGE low, LN+; or Table 5 if said individual is classified as ER+, ER/AGE low, LN~, 
wherein said individual is "ER/AGE high" if the ER level of the individual exceeds a 
predetermined value, and "ER/AGE low" if the ER level of the individual does not exceed 
said predetermined value. In a specific embodiment of this method, said predetermined value 
of ER is calculated as ER = 0.1 (AGE - 42.5), wherein AGE is the age of said individual. In 
another specific embodiment, said individual is ER~ BRCA1, and said plurality of genes 
comprises at least 1, 2, 3, 4, 5, 10 or all of the genes for which markers are listed in Table 1. 
In another specific embodiment, said individual is ER - , sporadic, and said plurality of genes 
comprises at least 1, 2, 3, 4, 5, 10 or all of the genes for which markers are listed in Table 2. 
In another specific embodiment, said individual is ER+, ER/AGE high, and said plurality of 
genes comprises at least 1, 2, 3, 4, 5, 10 or all of the genes for which markers are listed in 
Table 3. In another specific embodiment, said individual is ER+, ER/AGE low, LN+, and 
said plurality of genes comprises at least 1, 2, 3, 4, 5, 10 or all of the genes for which markers 
are listed in Table 4. In another specific embodiment, said individual is ER+, ER/AGE low, 
LN~ and said plurality of genes comprises at least 1, 2, 3, 4, 5, 10 or all of the genes for 
which markers are listed in Table 5. In another specific embodiment, the method additionally 
comprises determining in said cell sample the level of expression, relative to a control, of a 
second plurality of genes for which markers are not found in Tables 1-5, wherein said second 
plurality of genes is informative for prognosis. 

[00125] Where information is available regarding the LN status of a breast cancer patient, the 
patient may be identified as having a "very good prognosis," an "intermediate prognosis," or 
a poor prognosis, which enables the refinement of treatment. In one embodiment, the 
invention provides a method of assigning a therapeutic regimen to a breast cancer patient, 
comprising: (a) classifying said patient as having a "poor prognosis," "intermediate 
prognosis," or "very good prognosis" on the basis of the levels of expression of at least five 
genes for which markers are listed in Table 1, Table 2, Table 3, Table 4 or Table 5; and (b) 
assigning said patient a therapeutic regimen, said therapeutic regimen (i) comprising no 
adjuvant chemotherapy if the patient is lymph node negative and is classified as having a 
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good prognosis or an intermediate prognosis, or (ii) comprising chemotherapy if said patient 
has any other combination of lymph node status and expression profile. 
[00126] In another embodiment, a breast cancer patient is assigned a prognosis by a method 
comprising (a) determining the breast cancer patient's age, ER status, LN status and tumor 
type; (b) classifying said patient as ER - , sporadic; ER - BRCA1; ER+, ER/AGE high; ER+, 
ER/AGE low, LN+; or ER+, ER/AGE low, LN - ; (c) determining an expression profile 
comprising at least five genes in a cell sample taken from said breast cancer patient wherein 
markers for said at least five genes are listed in Table 1 if said patient is classified as ER - , 
sporadic; Table 2 if said patient is classified as ER - , BRCA1; Table 3 if said patient is 
classified as ER+, ER/AGE high; Table 4 if said patient is classified as ER+, ER/AGE low, 
LN+; or Table 5 if said patient is classified as ER+, ER/AGE high, LN - ; (d) determining the 
similarity of the expression profile of said at least five genes to a template profile comprising 
levels of expression of said at least five genes to obtain a patient similarity value; (e) 
comparing said patient similarity value to selected first and second threshold values of 
similarity, respectively, wherein said second similarity threshold indicates greater similarity 
to said template expression profile than does said first similarity threshold; and (f) classifying 
said breast cancer patient as having a first prognosis if said patient similarity value exceeds 
said second threshold similarity values, a second prognosis if said patient similarity value 
exceeds said first threshold similarity value but does not exceed said second threshold 
similarity value, and a third prognosis if said patient similarity value does not exceed said 
first threshold similarity value. In a specific embodiment of the method, said first prognosis 
is a "very good prognosis," said second prognosis is an "intermediate prognosis," and said 
third prognosis is a "poor prognosis," wherein said breast cancer patient is assigned a 
therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node 
negative and is classified as having a good prognosis or an intermediate prognosis, or 
comprising chemotherapy if said patient has any other combination of lymph node status and 
expression profile. 

[00127] The invention also provides a method of assigning a therapeutic regimen to a breast 
cancer patient, comprising: (a) determining the lymph node status for said patient; (b) 
determining the expression of at least five genes for which markers are listed in Table 5 in a 
cell sample from said patient, thereby generating an expression profile; (c) classifying said 
patient as having a "poor prognosis," "intermediate prognosis," or "very good prognosis" on 
the basis of said expression profile; and (d) assigning said patient a therapeutic regimen, said 
therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph node 
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negative and is classified as having a good prognosis or an intermediate prognosis, or 
comprising chemotherapy if said patient has any other combination of lymph node status and 
classification. In a specific embodiment of this method, said therapeutic regimen assigned to 
lymph node negative patients classified as having an "intermediate prognosis" additionally 
comprises adjuvant hormonal therapy. In another specific embodiment of this method, said 
classifying step (c) is carried out by a method comprising: (a) rank ordering in descending 
order a plurality of breast cancer tumor samples that compose a pool of breast cancer tumor 
samples by the degree of similarity between the expression profile of said at least five genes 
in each of said tumor samples and the expression profile of said at least five genes across all 
remaining tumor samples that compose said pool, said degree of similarity being expressed as 
a similarity value; (b) determining an acceptable number of false negatives in said classifying 
step, wherein a false negative is a breast cancer patient for whom the expression levels of said 
at least five genes in said cell sample predicts that said breast cancer patient will have no 
distant metastases within the first five years after initial diagnosis, but who has had a distant 
metastasis within the first five years after initial diagnosis; (c) determining a similarity value 
above which in said rank ordered list said acceptable number of tumor samples or fewer are 
false negatives; (d) selecting said similarity value determined in step (c) as a first threshold 
similarity value; (e) selecting a second similarity value, greater than said first similarity 
value, as a second threshold similarity value; and (f) determining the similarity between the 
expression profile of said at least five genes in a breast cancer tumor sample from the breast 
cancer patient and the expression profile of said respective at least five genes in said pool, to 
obtain a patient similarity value, wherein if said patient similarity value equals or exceeds 
said second threshold similarity value, said patient is classified as having a "very good 
prognosis"; if said patient similarity value equals or exceeds said first threshold similarity 
value, but is less than said second threshold similarity value, said patient is classified as 
having an "intermediate prognosis"; and if said patient similarity value is less than said first 
threshold similarity value, said patient is classified as having a "poor prognosis." Another 
specific embodiment of this method comprises determining the estrogen receptor (ER) status 
of said patient, wherein if said patient is ER positive and lymph node negative, said 
therapeutic regimen assigned to said patient additionally comprises adjuvant hormonal 
therapy. 

[00128] A patient in any patient subset or clinical class, e.g., any one of the classes described 
above, can be classified as having a particular prognosis level, e.g., a good prognosis or a 
poor prognosis, based on the similarity of the patient's cellular constituent profile to an 
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appropriate template profile for the prognosis level of patients in the clinical class. In one 
embodiment, a cellular constituent profile corresponding to a certain prognosis level, e.g., a 
profile comprising measurements of the plurality of cellular constituents representative of 
levels of the cellular constituents in a plurality of patients having the prognosis level is used 
as a template for the prognosis level. For example, a good prognosis template profile 
comprising measurements of the plurality of cellular constituents representative of levels of 
the cellular constituents in a plurality of good outcome patients or a poor prognosis template 
profile comprising measurements of the plurality of cellular constituents representative of 
levels of the cellular constituents in a plurality of poor outcome patients, can be used for 
determining whether a patient have good or poor prognosis. Here, a good outcome patient fe 
a patient who has non-reoccurrence of metastases within a period of time after initial 
diagnosis, e.g., a period of 1, 2, 3, 4, 5 or 10 years. In contrast, a poor outcome patient is a 
patient who has reoccurrence of metastases within a period of time after initial diagnosis, e.g., 
a period of 1, 2, 3, 4, 5 or 10 years. In a preferred embodiment, both periods are 10 years. 
Tables 1-5 show exemplary template profiles for the respective patient classes. For example, 
the expression profile of a patient with a combination of ER+, ER/AGE low, LN+ can be 
compared with the good prognosis template of Table 4 to determine if the patient has good 
prognosis or poor prognosis. 

[00129] The degree of similarity of the patient's cellular constituent profile to a template of a 
particular prognosis can be used to indicate whether the patient has the particular prognosis. 
For example, a high degree of similarity indicates that the patient has the particular prognosis, 
whereas a low degree of similarity indicates that the patient does not have the particular 
prognosis. In a preferred embodiment, a patient is classified as having a good prognosis 
profile if the patient's cellular constituent profile has a high similarity to a good prognosis 
template and/or has a low similarity to a poor prognosis template. In another embodiment, a 
patient is classified as having a poor prognosis profile if the patient's cellular constituent 
profile has a low similarity to a good prognosis template and/or has a high similarity to a poor 
prognosis template. In embodiments for predicting the responsiveness of a breast cancer 
patient under the age of 55, the patients in the good and poor outcome patient populations 
used to generate the templates are preferably also under the age of 55 at the time of diagnosis 
of breast cancer. 

[00130] The degree of similarity between a patient's cellular constituent profile and a 
template profile can be determined using any method known in the art. In one embodiment, 
the similarity is represented by a correlation coefficient between the patient's profile and the 
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template. In one embodiment, a correlation coefficient above a correlation threshold 
indicates high similarity, whereas a correlation coefficient below the threshold indicates low 
similarity. In preferred embodiments, the correlation threshold is set as 0.3, 0.4, 0.5 or 0.6. 
In another embodiment, similarity between a patient's profile and a template is represented 
by a distance between the patient's profile and the template. In one embodiment, a distance 
below a given value indicates high similarity, whereas a distance equal to or greater than the 
given value indicates low similarity. 

[0013 1] As an illustration, in one embodiment, a template for a good prognosis is defined as 
z x (e.g., a profile consisting of the xdev's listed in the good prognosis column of one of 
Tables 1-5) and/or a template for poor prognosis is defined as z 2 (e.g., a profile consisting of 
the xdev's listed in the poor prognosis column of one of Tables 1-5). Either one or both of 
the two classifier parameters (P x and P 2 ) can then be used to measure degrees of similarities 
between a patient's profile and the respective templates: P x measures the similarity between 
the patient's profile y and the good prognosis template z x , and P 2 measures the similarity 
between y and the poor prognosis template z 2 . In embodiments which employ correlation 
coefficients, the correlation coefficient P t can be calculated as: 

p^^^nm^ ( 4) 

where i = 1 and 2. 

[00132] Thus, in one embodiment, y is classified as a good prognosis profile if P x is greater 
than a selected correlation threshold or if P 2 is equal to or less than a selected correlation 
threshold. In another embodiment, y is classified as a poor prognosis profile if P x is less 
than a selected correlation threshold or if P 2 is above a selected correlation threshold. In still 
another embodiment, y is classified as a good prognosis profile if P x is greater than a first 
selected correlation threshold and y is classified as a poor prognosis profile if P 2 is greater 
than a second selected correlation threshold. 

[00133] In a preferred embodiment, the cellular constituent profile is an expression profile 
comprising measurements of a plurality of transcripts (e.g., measured as mRNAs or cDNAs) 
in a sample derived from a patient, e.g., the plurality of transcripts corresponding to the 
markers in all or a portion of one of Tables 1-5. In this embodiment, the good prognosis 
template can be a good prognosis expression template comprising measurements of the 
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plurality of transcripts representative of expression levels of the transcripts in a plurality of 
good prognosis patients, and the poor prognosis template can be a poor prognosis expression 
template comprising measurements of the plurality of transcripts representative of expression 
levels of the transcripts in a plurality of poor prognosis patients. In a preferred embodiment, 
measurement of each transcript in the good or poor prognosis expression template is an 
average of expression levels of the transcript in the plurality of good or poor prognosis 
patients, respectively. 

[00134] In another embodiment, the expression profile is a differential expression profile 
comprising differential measurements of the plurality of transcripts in a sample derived from 
the patient versus measurements of the plurality of transcripts in a control sample. The 
differential measurements can be xdev, log(ratio), error- weighted log(ratio), or a mean 
subtracted log(intensity) (see, e.g., Stoughton et al., PCT publication WO 00/39339, 
published on July 6, 2000; U.S. Patent Application No. 10/848,755, filed May 18, 2004, by 
Mao et al., attorney docket no: 9301-188-999, each of which is incorporated herein by 
reference in its entirety). 

5.4.3 IMPROVING SENSITIVITY TO EXPRESSION LEVEL 
DIFFERENCES 

[00135] In using the markers disclosed herein, and, indeed, using any sets of markers, e.g., to 
compare profiles or to differentiate an individual having one phenotype from another 
individual having a second phenotype, one can compare the profile comprising absolute 
expression levels of the markers in a sample to a template; for example, a template 
comprising the average levels of expression of the markers in a plurality of individuals. To 
increase the sensitivity of the comparison, however, the expression level values are preferably 
transformed in a number of ways. Also, to differentiate an individual having one phenotype 
from another individual having a second phenotype using any sets of markers, one can 
compare the absolute expression of each of the markers in a sample to a control; for example, 
the control can be the average level of expression of each of the markers, respectively, in a 
pool of individuals. 

[00136] For example, the expression level of each of the markers can be normalized by the 
average expression level of all markers the expression level of which is determined, or by the 
average expression level of a set of control genes. Thus, in one embodiment, the markers are 
represented by probes on a microarray, and the expression level of each of the markers is 
normalized by the mean or median expression level across all of the genes represented on the 
microarray, including any non-marker genes. In a specific embodiment, the normalization is 
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carried out by dividing the median or mean level of expression of all of the genes on the 
microarray. In another embodiment, the expression levels of the markers is normalized by 
the mean or median level of expression of a set of control markers. In a specific 
embodiment, the control markers comprise a set of housekeeping genes. In another specific 
embodiment, the normalization is accomplished by dividing by the median or mean 
expression level of the control genes. 

[00137] The sensitivity of a marker-based assay will also be increased if the expression 
levels of individual markers are compared to the expression of the same markers in a pool of 
samples. Preferably, the comparison is to the mean or median expression level of each the 
marker genes in the pool of samples. Such a comparison may be accomplished, for example, 
by dividing by the mean or median expression level of the pool for each of the markers from 
the expression level each of the markers in the sample. This has the effect of accentuating the 
relative differences in expression between markers in the sample and markers in the pool as a 
whole, making comparisons more sensitive and more likely to produce meaningful results 
that the use of absolute expression levels alone. The expression level data may be 
transformed in any convenient way; preferably, the expression level data for all is log 
transformed before means or medians are taken. 

[00138] In performing comparisons to a pool, two approaches may be used. First, the 
expression levels of the markers in the sample may be compared to the expression level of 
those markers in the pool, where nucleic acid derived from the sample and nucleic acid 
derived from the pool are hybridized during the course of a single experiment. Such an 
approach requires that new pool nucleic acid be generated for each comparison or limited 
numbers of comparisons, and is therefore limited by the amount of nucleic acid available. 
Alternatively, and preferably, the expression levels in a pool, whether normalized and/or 
transformed or not, are stored on a computer, or on computer-readable media, to be used in 
comparisons to the individual expression level data from the sample (i.e., single-channel 
data). 

[00139] The current invention also provides the following method of classifying a first cell or 
organism as having one of at least two different phenotypes, where the different phenotypes 
comprise a first phenotype and a second phenotype. The level of expression of each of a 
plurality of markers in a first sample from the first cell or organism is compared to the level 
of expression of each of said markers, respectively, in a pooled sample from a plurality of 
cells or organisms, the plurality of cells or organisms comprising different cells or organisms 
exhibiting said at least two different phenotypes, respectively, to produce a first compared 
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value. The first compared value is then compared to a second compared value, wherein said 
second compared value is the product of a method comprising comparing the level of 
expression of each of said markers in a sample from a cell or organism characterized as 
having said first phenotype to the level of expression of each of said markers, respectively, in 
the pooled sample. The first compared value is then compared to a third compared value, 
wherein said third compared value is the product of a method comprising comparing the level 
of expression of each of the markers in a sample from a cell or organism characterized as 
having the second phenotype to the level of expression of each of the markers, respectively, 
in the pooled sample. In specific embodiments, the marker can be a gene, a protein encoded 
by the gene, etc. Optionally, the first compared value can be compared to additional 
compared values, respectively, where each additional compared value is the product of a 
method comprising comparing the level of expression of each of said markers in a sample 
from a cell or organism characterized as having a phenotype different from said first and 
second phenotypes but included among the at least two different phenotypes, to the level of 
expression of each of said genes, respectively, in said pooled sample. Finally, a 
determination is made as to which of said second, third, and, if present, one or more 
additional compared values, said first compared value is most similar, wherein the first cell or 
organism is determined to have the phenotype of the cell or organism used to produce said 
compared value most similar to said first compared value. 

[00140] The sensitivity of a marker-based assay will also be increased if the expression 
levels of individual markers are compared to the expression of the same markers in a control 
sample, e.g., a sample comprises a pool of samples, to generate a differential expression 
profile. Such a comparison may be accomplished, for example, by determining a ratio 
between expression level of each marker in the sample and the expression level of the 
corresponding marker in the control sample. This has the effect of accentuating the relative 
differences in expression between markers in the sample and markers in the control as a 
whole, making subsequent comparisons to a template more sensitive and more likely to 
produce meaningful results than the use of absolute expression levels alone. The comparison 
may be performed in any convenient way, e.g., by taking difference, ratio, or log(ratio). 
[00141] In performing comparisons to a control sample, two approaches may be used. First, 
the expression levels of the markers in the sample may be compared to the expression level of 
those markers in the control sample, where nucleic acid derived from the sample and nucleic 
acid derived from the control are hybridized during the course of a single experiment. Such 
an approach requires that new control sample of nucleic acid be generated for each 
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comparison or limited numbers of comparisons, and is therefore limited by the amount of 
nucleic acid available. Alternatively, the expression levels in a control sample, whether 
normalized and/or transformed or not, are stored on a computer, or on computer-readable 
media, to be used in comparisons to the individual expression level data from the sample (i.e., 
single-channel data). 

[00142] The methods of the invention preferably use a control or reference sample, which 
can be any suitable sample against which changes in cellular constituents can be determined. 
In one embodiment, the control or reference sample is generated by pooling together the 
plurality of cellular constituents, e.g., a plurality of transcripts or cDNAs, or a plurality of 
protein species, from a plurality of breast cancer patients. Alternatively, the control or 
reference sample can be generated by pooling together purified or synthesized cellular 
constituents, e.g., a plurality of purified or synthesized transcripts or cDNAs, a plurality of 
purified or synthesized protein species. In one embodiment, synthetic RNAs for each 
transcripts or cDNAs are pooled to form the control or reference sample. Preferably, the 
abundances of synthetic RNAs are approximately the abundances of the corresponding 
transcripts in a real tumor pool. The differential expression of marker genes for each 
individual patient sample is measured against this control sample. In one embodiment, 60- 
mer oligonucleotides corresponding to the probe sequences on a microarray used to assay the 
expression levels of the diagnostic/prognostic transcripts are synthesized and cloned into 
pBluescript SK- vector (Statagene, La Jolla, CA), adjacent to the T7 promotor sequence. 
Individual clones are isolated, and the sequences of their inserts are verified by DNA 
sequencing. To generate synthetic RNAs, clones are linearized with EcoRI and a T7 in vitro 
transcription (IVT) reaction is performed by MegaScript kit (Ambion, Austin, TX), followed 
by DNase treatment of the product. Synthetic RNAs are purified on RNeasy columns 
(Qiagen, Valencia, CA). These synthetic RNAs are transcribed, amplified, labeled, and 
mixed together to make the reference pool. The abundance of those synthetic RNAs are 
chosen to approximate the abundances of the transcripts of the corresponding marker genes in 
the real tumor pool. 

[00143] The current invention provides the following method of classifying a first cell or 
organism as having one of at least two different phenotypes, where the different phenotypes 
comprise a first phenotype and a second phenotype. The level of expression of each of a 
plurality of markers in a first sample from the first cell or organism is compared to the level 
of expression of each of said markers, respectively, in a pooled sample from a plurality of 
cells or organisms, the plurality of cells or organisms comprising different cells or organisms 
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exhibiting said at least two different phenotypes, respectively, to produce a first compared 
value so that a first differential profile comprising a plurality of first compared values for said 
plurality of markers is generated. The first differential profile is then compared to a second 
differential profile comprising a plurality of second compared values, wherein each said 
second compared value is the product of a method comprising comparing the level of 
expression of each of said markers in a sample from a cell or organism characterized as 
having said first phenotype to the level of expression of each of said markers, respectively, in 
the pooled sample. The first differential profile is then compared to a third differential profile 
comprising a plurality of third compared values, wherein each said third compared value is 
the product of a method comprising comparing the level of expression of each of the markers 
in a sample from a cell or organism characterized as having the second phenotype to the level 
of expression of each of the markers, respectively, in the pooled sample. In specific 
embodiments, each marker can be a gene, a protein encoded by the gene, etc. Optionally, the 
first differential profile can be compared to additional expression profiles each of which 
comprising additional compared values, respectively, where each additional compared value 
is the product of a method comprising comparing the level of expression of each of said 
markers in a sample from a cell or organism characterized as having a phenotype different 
from said first and second phenotypes but included among the at least two different 
phenotypes, to the level of expression of each of said genes, respectively, in said pooled 
sample. Finally, a determination is made as to which of said second, third, and, if present, 
one or more additional differential profiles, said first differential profile is most similar, 
wherein the first cell or organism is determined to have the phenotype of the cell or organism 
used to produce said differential profile most similar to said first differential profile. 
[00144] In a specific embodiment of this method, the compared values are each ratios of the 
levels of expression of each of said genes. In another specific embodiment, each of the levels 
of expression of each of the genes in the pooled sample are normalized prior to any of the 
comparing steps. In a more specific embodiment, the normalization of the levels of 
expression is carried out by dividing by the median or mean level of the expression of each of 
the genes or dividing by the mean or median level of expression of one or more housekeeping 
genes in the pooled sample from said cell or organism. In another specific embodiment, the 
normalized levels of expression are subjected to a log transform, and the comparing steps 
comprise subtracting the log transform from the log of the levels of expression of each of the 
genes in the sample. In another specific embodiment, the two or more different phenotypes 
are different stages of a disease or disorder. In still another specific embodiment, the two or 
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more different phenotypes are different prognoses of a disease or disorder. In yet another 
specific embodiment, the levels of expression of each of the genes, respectively, in the pooled 
sample or said levels of expression of each of said genes in a sample from the cell or 
organism characterized as having the first phenotype, second phenotype, or said phenotype 
different from said first and second phenotypes, respectively, are stored on a computer or on 
a computer-readable medium. 

[00145] In another specific embodiment, the two phenotypes are good prognosis and poor 
prognosis. In a more specific embodiment, the two phenotypes are good prognosis and poor 
prognosis for an individual that is identified as having ER~, BRCA1 status, ER~~, sporadic 
status, ER+, ER/AGE high status, ER+, ER/AGE low, LN+ status, or ER+, ER/AGE low, 
LN+ status. 

[00146] In another specific embodiment, the comparison is made between the expression 
profile of the genes in the sample and the expression profile of the same genes in a pool 
representing only one of two or more phenotypes. In the context of prognosis-correlated 
genes, for example, one can compare the expression levels of prognosis-related genes in a 
sample to the average levels of the expression of the same genes in a plurality of "good 
prognosis" samples (as opposed to a plurality of samples that include samples from patients 
having poor prognoses and good prognoses). Thus, in this method, a sample is classified as 
having a good prognosis if the expression profile of prognosis-correlated genes exceeds a 
chosen coefficient of correlation to the average "good prognosis" expression profile (e.g., the 
profile comprising average levels of expression of prognosis-correlated genes in samples 
from a plurality of patients having a "good prognosis"). Patients whose expression profiles 
correlate more poorly with the "good prognosis" expression profile (e.g., whose correlation 
coefficient fails to exceed the chosen coefficient) are classified as having a poor prognosis. 
[00147] Where individuals are classified on the basis of phenotypic, genotypic, or clinical 
characteristics into patient subsets, the pool of samples may be a pool of samples for the 
phenotype that includes samples representing each of the patient subsets. Alternatively, the 
pool of samples may be a pool of samples for the phenotype representing only the specific 
patient subset. For example, where an individual is classified as ER+, sporadic, the pool of 
samples to which the individual's sample is compared may be a pool of samples from ER+, 
sporadic individuals having a good prognosis only, or may be a pool of samples of 
individuals having a good prognosis, without regard to ER status or mutation type. 
[00148] The method can be applied to a plurality of patient subsets. For example, in a 
specific embodiment, the phenotype is good prognosis, and the individual is classified into 
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one of the following patient subsets: ER~, BRCA1 status, ER~ sporadic status, ER+, 
ER/AGE high status, ER+, ER/AGE low, LN+ status, or ER+, ER/AGE low, LN+ status. A 
set of markers informative for prognosis for the patient subset into which the individual is 
classified is then used to determine the likely prognosis for the individual. A sample is 
classified as coming from an individual having a good prognosis if the expression profile of 
prognosis-correlated genes for the particular subset into which the individual is classified 
exceeds a chosen coefficient of correlation to the average "good prognosis" expression 
profile {e.g., the levels of expression of prognosis-correlated genes in a plurality of samples 
from patients within the subclass having a "good prognosis"). Patients whose expression 
levels correlate more poorly with the "good prognosis" expression profile {e.g., whose 
correlation coefficient fails to exceed the chosen coefficient) are classified as having a poor 
prognosis. 

[00149] Of course, single-channel data may also be used without specific comparison to a 
mathematical sample pool. For example, a sample may be classified as having a first or a 
second phenotype, wherein the first and second phenotypes are related, by calculating the 
similarity between the expression profile of at least 5 markers in the sample, where the 
markers are correlated with the first or second phenotype, to a first phenotype template and a 
second phenotype template each comprising the expression levels of the same markers, by (a) 
labeling nucleic acids derived from a sample with a fluorophore to obtain a pool of 
fluorophore-labeled nucleic acids; (b) contacting said fluorophore-labeled nucleic acid with a 
microarray under conditions such that hybridization can occur, detecting at each of a plurality 
of discrete loci on the microarray a fluorescent emission signal from said fluorophore-labeled 
nucleic acid that is bound to said microarray under said conditions; and (c) determining the 
similarity of marker gene expression in the individual sample to the first and second 
templates, wherein if said expression is more similar to the first template, the sample is 
classified as having the first phenotype, and if said expression is more similar to the second 
template, the sample is classified as having the second phenotype. 
[0100] In a specific embodiment of the above method, the first phenotype is a good 
prognosis of breast cancer, the sample is a sample from an individual that has been classified 
into a patient subset, and the first and second templates are templates for the phenotype for 
the particular patient subset. In a more specific embodiment, for example, the first phenotype 
is a good prognosis, the second phenotype is a poor prognosis, the patient is classified into an 
ER~~, sporadic patient subset, an ER~ BRCA1 subset, an ER+, ER/AGE high subset, an ER+, 
ER/AGE low, LN+ subset, or an ER+, ER/AGE low, LN+ subset, and said first and second 
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templates are templates derived from the expression of the marker genes in individuals 
having a good prognosis and a poor prognosis, respectively, wherein said individuals are all 
of the patient subset into which said patient is classified. 

5.5 DETERMINATION OF MARKER GENE EXPRESSION LEVELS 
5.5.1 METHODS 

[00150] The expression levels of the marker genes in a sample may be determined by any 
means known in the art. The expression level may be determined by isolating and 
determining the level (i.e., amount) of nucleic acid transcribed from each marker gene. 
Alternatively, or additionally, the level of specific proteins encoded by a marker gene may be 
determined. 

[00151] The level of expression of specific marker genes can be accomplished by 
determining the amount of mRNA, or polynucleotides derived therefrom, present in a sample. 
Any method for determining RNA levels can be used. For example, RNA is isolated from a 
sample and separated on an agarose gel. The separated RNA is then transferred to a solid 
support, such as a filter. Nucleic acid probes representing one or more markers are then 
hybridized to the filter by northern hybridization, and the amount of marker-derived RNA is 
determined. Such determination can be visual, or machine-aided, for example, by use of a 
densitometer. Another method of determining RNA levels is by use of a dot-blot or a slot- 
blot. In this method, RNA, or nucleic acid derived therefrom, from a sample is labeled. The 
RNA or nucleic acid derived therefrom is then hybridized to a filter containing 
oligonucleotides derived from one or more marker genes, wherein the oligonucleotides are 
placed upon the filter at discrete, easily-identifiable locations. Hybridization, or lack thereof, 
of the labeled RNA to the filter-bound oligonucleotides is determined visually or by 
densitometer. Polynucleotides can be labeled using a radiolabel or a fluorescent (i.e., visible) 
label. 

[00152] These examples are not intended to be limiting; other methods of determining RNA 
abundance are known in the art. 

[00153] The level of expression of particular marker genes may also be assessed by 
determining the level of the specific protein expressed from the marker genes. This can be 
accomplished, for example, by separation of proteins from a sample on a polyacrylamide gel, 
followed by identification of specific marker-derived proteins using antibodies in a western 
blot. Alternatively, proteins can be separated by two-dimensional gel electrophoresis 
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systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves 
isoelectric focusing along a first dimension followed by SDS-PAGE electrophoresis along a 
second dimension. See, e.g., Hames et al, 1990, Gel Electrophoresis of Proteins: A 
Practical Approach, IRL Press, New York; Shevchenko et al, Proc. Natl Acad. Set USA 
93:1440-1445 (1996); Sagliocco et al, Yeast 12:1519-1533 (1996); Lander, Science 
274:536-539 (1996). The resulting electropherograms can be analyzed by numerous 
techniques, including mass spectrometric techniques, western blotting and immunoblot 
analysis using polyclonal and monoclonal antibodies. 

[00154] Alternatively, marker-derived protein levels can be determined by constructing an 
antibody microarray in which binding sites comprise immobilized, preferably monoclonal, 
antibodies specific to a plurality of protein species encoded by the cell genome. Preferably, 
antibodies are present for a substantial fraction of the marker-derived proteins of interest. 
Methods for making monoclonal antibodies are well known {see, e.g., Harlow and Lane, 
1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, New York, which is 
incorporated in its entirety for all purposes). In one embodiment, monoclonal antibodies are 
raised against synthetic peptide fragments designed based on genomic sequence of the cell. 
With such an antibody array, proteins from the cell are contacted to the array, and then- 
binding is assayed with assays known in the art. Generally, the expression, and the level of 
expression, of proteins of diagnostic or prognostic interest can be detected through 
immunohistochemical staining of tissue slices or sections. 

[00155] Finally, expression of marker genes in a number of tissue specimens may be 
characterized using a "tissue array" (Kononen et al. , Nat. Med 4(7):844-7 (1 998)). In a tissue 
array, multiple tissue samples are assessed on the same microarray. The arrays allow in situ 
detection of RNA and protein levels; consecutive sections allow the analysis of multiple 
samples simultaneously. 

5.5.2 MICROAREAYS 
[00156] In preferred embodiments, polynucleotide microarrays are used to measure 
expression so that the expression status of each of the markers above is assessed 
simultaneously. Generally, microarrays according to the invention comprise a plurality of 
markers informative for prognosis, or outcome determination, for a particular disease or 
condition, and, in particular, for individuals having specific combinations of genotypic or 
phenotypic characteristics of the disease or condition (i.e., that are prognosis-informative for 
a particular patient subset). 
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[00157] The microarrays of the invention preferably comprise at least 2, 3, 4, 5, 7, 10, 15, 20, 
25, 30, 35, 40, 45, 50, 75, 100, 150, 200 or more of markers, or all of the markers, or any 
combination of markers, identified as prognosis-informative within a patient subset. The 
actual number of informative markers the microarray comprises will vary depending upon the 
particular condition of interest, the number of markers identified, and, optionally, the number 
of informative markers found to result in the least Type I error, Type II error, or Type I and 
Type II error in determination of prognosis. As used herein, "Type I error" means a false 
positive and "Type II error" means a false negative; in the example of prognosis of beast 
cancer, Type I error is the mis-characterization of an individual with a good prognosis as 
having a poor prognosis, and Type II error is the mis-characterization of an individual with a 
poor prognosis as having a good prognosis. 

[00158] In specific embodiments, the invention provides polynucleotide arrays in which the 
prognosis markers identified for a particular patient subset comprise at least 50%, 60%, 70%, 
80%, 85%, 90%, 95% or 98% of the probes on said array. In another specific embodiment, 
the microarray comprises a plurality of probes, wherein said plurality of probes comprise 
probes complementary and hybridizable to at least 75% of the prognosis-informative markers 
identified for a particular patient subset. Microarrays of the invention, of course, may 
comprise probes complementary and hybridizable to prognosis-informative markers for a 
plurality of the patient subsets, or for each patient subset, identified for a particular condition. 
In another embodiment, therefore, the microarray of the invention comprises a plurality of 
probes complementary and hybridizable to at least 75% of the prognosis-informative markers 
identified for each patient subset identified for the condition of interest, and wherein said 
probes, in total, are at least 50% of the probes on said microarray. 
[00159] In yet another specific embodiment, microarrays that are used in the methods 
disclosed herein optionally comprise markers additional to at least some of the markers 
identified by the methods disclosed elsewhere herein. For example, in a specific 
embodiment, the microarray is a screening or scanning array as described in Altschuler et al. 9 
International Publication WO 02/18646, published March 7, 2002 and Scherer et al, 
International Publication WO 02/16650, published February 28, 2002. The scanning and 
screening arrays comprise regularly-spaced, positionally-addressable probes derived from 
genomic nucleic acid sequence, both expressed and unexpressed. Such arrays may comprise 
probes corresponding to a subset of, or all of, the markers identified for the patient subset(s) 
for the condition of interest, and can be used to monitor marker expression in the same way 
as a microarray containing only prognosis-informative markers otherwise identified. 
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[00160] In yet another specific embodiment, the microarray is a commercially-available 
cDNA microarray that comprises at least five markers identified by the methods described 
herein. Preferably, a commercially-available cDNA microarray comprises all of the markers 
identified by the methods described herein as being informative for a patient subset for a 
particular condition. However, such a microarray may comprise at least 5, 10, 15 or 25 of 
such markers, up to the maximum number of markers identified. 
[00161] In an embodiment specific to breast cancer, the invention provides for 
oligonucleotide or cDNA arrays comprising probes hybridizable to the genes corresponding 
to each of the marker sets described above (i.e., markers informative for ER", sporadic 
individuals, markers informative for ER~ BRCA1 individuals, markers informative for ER+, 
ER/AGE high individuals, markers informative for ER+, ER/AGE low, LN+ individuals, and 
markers infomiative for ER+, ER/AGE low, LN~~ individuals, as shown in Tables 1-5). Any 
of the microarrays described herein may be provided in a sealed container in a kit. 
[00162] The invention provides microarrays containing probes useful for the prognosis of 
any breast cancer patient, or for breast cancer patients classified into one of a plurality of 
patient subsets. In particular, the invention provides polynucleotide arrays comprising probes 
to a subset or subsets of at least 5, 10, 1 5, 20, 25 or more of the genetic markers, or up to the 
full set of markers, in any of Tables 1-5, which distinguish between patients with good and 
poor prognosis. In certain embodiments, therefore, the invention provides microarrays 
comprising probes for a plurality of the genes for which markers are listed in Tables 1, 2, 3, 4 
or 5. In a specific embodiment, the microarray of the invention comprises 1, 2, 3, 4, 5 or 10 
of the markers in Table 1, at least five of the markers in Table 2; 1, 2, 3, 4, 5 or 10 of the 
markers in Table 3; 1, 2, 3, 4, 5 or 10 of the markers in Table 4; or 1, 2, 3, 4, 5 or 10 of the 
markers in Table 1. In other embodiments, the microarray comprises probes for 1, 2, 3, 4, 5, 
or 10 of the markers shown in any two, three or four of Tables 1-5, or all of Tables 1-5. In 
other embodiments, the microarray of the invention contains each of the markers in Table 1, 
Table 2, Table 3, Table 4, or Table 5. In another embodiment, the microarray contains all of 
the markers shown in Tables 1-5. In specific embodiments, the array comprises probes 
derived only from the markers listed in Table 1, Table 2, Table 3, Table 4, or Table 5; probes 
derived from any two of Tables 1-5; any three of Tables 1-5; any four of Tables 1-5; or all of 
Tables 1-5. 

[00163] In other embodiments, the array comprises a plurality of probes derived from 
markers listed in any of Tables 1-5 in combination with a plurality of other probes, derived 
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from markers not listed in any of Tables 1-5, that are identified as informative for the 
prognosis of breast cancer. 

[00164] In specific embodiments, the invention provides polynucleotide arrays in which the 
breast cancer prognosis markers described herein in Tables 1, 2, 3, 4 and/or 5 comprise at 
least 50%, 60%, 70%, 80%, 85%, 90%, 95% or 98% of the probes on said array. In another 
specific embodiment, the microarray comprises a plurality of probes, wherein said plurality 
of probes comprise probes complementary and hybridizable to at least 75% of the genes for 
which markers are listed in Table 1; probes complementary and hybridizable to at least 75% 
of the genes for which markers are listed in Table 2; probes complementary and hybridizable 
to at least 75% of the genes for which markers are listed in Table 3; probes complementary 
and hybridizable to at least 75% of the genes for which markers are listed in Table 4; and 
probes complementary and hybridizable to at least 75% of the genes for which markers are 
listed in Table 5, wherein said probes, in total, are at least 50% of the probes on said 
microarray. 

[00165] In yet another specific embodiment, microarrays that are used in the methods 
disclosed herein optionally comprise markers additional to at least some of the markers listed 
in Tables 1-5. For example, in a specific embodiment, the microarray is a screening or 
scanning array as described in Altschuler et ah, International Publication WO 02/18646, 
published March 7, 2002 and Scherer et ah, International Publication WO 02/16650, 
published February 28, 2002. The scanning and screening arrays comprise regularly-spaced, 
positionally-addressable probes derived from genomic nucleic acid sequence, both expressed 
and unexpressed. Such arrays may comprise probes corresponding to a subset of, or all of, 
the markers listed in Tables 1-5, or a subset thereof as described above, and can be used to 
monitor marker expression in the same way as a microarray containing only markers listed in 
Tables 1-5. 

[00166] In yet another specific embodiment, the microarray is a commercially-available 
cDNA microarray that comprises at least five of the markers listed in Tables 1-5. Preferably, 
a commercially-available cDNA microarray comprises all of the markers listed in Tables 1-5. 
However, such a microarray may comprise at least 5, 10, 15 or 25 of the markers in any of 
Tables 1-5, up to the maximum number of markers in a Table, and may comprise all of the 
markers in any one of Tables 1-5, and a subset of another of Tables 1-5, or subsets of each as 
described above. In a specific embodiment of the microarrays used in the methods disclosed 
herein, the markers that are all or a portion of Tables 1-5 make up at least 50%, 60%, 70%, 
80%, 90%, 95% or 98% of the probes on the microarray. 
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[00167] General methods pertaining to the construction of microarrays comprising the 

marker sets and/or subsets above are described in the following sections. 

[00168] 

[00169] 

5.5.2.1 CONSTRUCTION OF MICROARRAYS 
[00170] Microarrays are prepared by selecting probes which comprise a polynucleotide 
sequence, and then immobilizing such probes to a solid support or surface. For example, the 
probes may comprise DNA sequences, RNA sequences, or copolymer sequences of DNA and 
RNA. The polynucleotide sequences of the probes may also comprise DNA and/or RNA 
analogues, or combinations thereof. For example, the polynucleotide sequences of the probes 
may be full or partial fragments of genomic DNA. The polynucleotide sequences of the 
probes may also be synthesized nucleotide sequences, such as synthetic oligonucleotide 
sequences. The probe sequences can be synthesized either enzymatically in vivo, 
enzymatically in vitro (e.g., by PGR), or non-enzymatically in vitro. 

[00171] The probe or probes used in the methods of the invention are preferably immobilized 
to a solid support which may be either porous or non-porous. For example, the probes of the 
invention may be polynucleotide sequences which are attached to a nitrocellulose or nylon 
membrane or filter covalently at either the 3' or the 5' end of the polynucleotide. Such 
hybridization probes are well known in the art (see, e.g., Sambrook et aL, Molecular 
Cloning - A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring Harbor Laboratory, 
Cold Spring Harbor, New York (1989). Alternatively, the solid support or surface may be a 
glass or plastic surface. In a particularly preferred embodiment, hybridization levels are 
measured to microarrays of probes consisting of a solid phase on the surface of which are 
immobilized a population of polynucleotides, such as a population of DNA or DNA mimics, 
or, alternatively, a population of RNA or RNA mimics. The solid phase may be a nonporous 
or, optionally, a porous material such as a gel. 

[00172] In preferred embodiments, a microarray comprises a support or surface with an 
ordered array of binding (e.g., hybridization) sites or "probes" each representing one of the 
markers described herein. Preferably the microarrays are addressable arrays, and more 
preferably positionally addressable arrays. More specifically, each probe of the array is 
preferably located at a known, predetermined position on the solid support such that the 
identity (i.e., the sequence) of each probe can be determined from its position in the array 
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(i.e., on the support or surface). In preferred embodiments, each probe is covalently attached 
to the solid support at a single site. 

[00173] Microarrays can be made in a number of ways, of which several are described 
below. However produced, microarrays share certain characteristics. The arrays are 
reproducible, allowing multiple copies of a given array to be produced and easily compared 
with each other. Preferably, microarrays are made from materials that are stable under 
binding (e.g., nucleic acid hybridization) conditions. The microarrays are preferably small, 
e.g., between 1 cm 2 and 25 cm 2 , between 12 cm 2 and 13 cm 2 , or 3 cm 2 . However, larger 
arrays are also contemplated and may be preferable, e.g., for use in screening arrays. 
Preferably, a given binding site or unique set of binding sites in the microarray will 
specifically bind (e.g., hybridize) to the product of a single gene in a cell (e.g., to a specific 
mRNA, or to a specific cDNA derived therefrom). However, in general, other related or 
similar sequences will cross hybridize to a given binding site. 

[00174] The microarrays of the present invention include one or more test probes, each of 
which has a polynucleotide sequence that is complementary to a subsequence of RNA or 
DNA to be detected. Preferably, the position of each probe on the solid surface is known. 
Indeed, the microarrays are preferably positionally addressable arrays. Specifically, each 
probe of the array is preferably located at a known, predetermined position on the solid 
support such that the identity (i.e., the sequence) of each probe can be determined from its 
position on the array (i.e., on the support or surface). 

[00175] According to the invention, the microarray is an array (i.e., a matrix) in which each 
position represents one of the markers described herein. For example, each position can 
contain a DNA or DNA analogue based on genomic DNA to which a particular RNA or 
cDNA transcribed from that genetic marker can specifically hybridize. The DNA or DNA 
analogue can be, e.g., a synthetic oligomer or a gene fragment. In one embodiment, probes 
representing each of the markers is present on the array. In a preferred embodiment, the array 
comprises probes for each of the markers listed in Tables 1-5. 

5.5.2.2 PREPARING PROBES FOR MICROARRAYS 
[00176] As noted above, the "probe" to which a particular polynucleotide molecule 
specifically hybridizes according to the invention contains a complementary genomic 
polynucleotide sequence. The probes of the microarray preferably consist of nucleotide 
sequences of no more than 1,000 nucleotides. In some embodiments, the probes of the array 
consist of nucleotide sequences of 10 to 1,000 nucleotides. In a preferred embodiment, the 
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nucleotide sequences of the probes are in the range of 10-200 nucleotides in length and are 
genomic sequences of a species of organism, such that a plurality of different probes is 
present, with sequences complementary and thus capable of hybridizing to the genome of 
such a species of organism, sequentially tiled across all or a portion of such genome. In other 
specific embodiments, the probes are in the range of 10-30 nucleotides in length, in the range 
of 10-40 nucleotides in length, in the range of 20-50 nucleotides in length, in the range of 40- 
80 nucleotides in length, in the range of 50-150 nucleotides in length, in the range of 80-120 
nucleotides in length, and most preferably are 60 nucleotides in length. 
[00177] The probes may comprise DNA or DNA "mimics" (e.g., derivatives and analogues) 
corresponding to a portion of an organism's genome. In another embodiment, the probes of 
the microarray are complementary RNA or RNA mimics. DNA mimics are polymers 
composed of subunits capable of specific, Watson-Crick-like hybridization with DNA, or of 
specific hybridization with RNA. The nucleic acids can be modified at the base moiety, at 
the sugar moiety, or at the phosphate backbone. Exemplary DNA mimics include, e.g., 
phosphorothioates. 

[001 78] DNA can be obtained, e.g., by polymerase chain reaction (PGR) amplification of 
genomic DNA or cloned sequences. PCR primers are preferably chosen based on a known 
sequence of the genome that will result in amplification of specific fragments of genomic 
DNA. Computer programs that are well known in the art are useful in the design of primers 
with the required specificity and optimal amplification properties, such as Oligo version 5.0 
(National Biosciences). Typically each probe on the microarray will be between 10 bases 
and 50,000 bases, usually between 300 bases and 1,000 bases in length. PCR methods are 
well known in the art, and are described, for example, in Innis et al., eds., PCR Protocols: 
A Guide to Methods and Applications, Academic Press Inc., San Diego, CA (1990). It 
will be apparent to one skilled in the art that controlled robotic systems are useful for 
isolating and amplifying nucleic acids. 

[00179] An alternative, preferred means for generating the polynucleotide probes of the 
microarray is by synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N- 
phosphonate or phosphoramidite chemistries (Froehler et al., Nucleic Acid Res. 14:5399-5407 
(1986); McBride et al, Tetrahedron Lett. 24:246-248 (1983)). Synthetic sequences are 
typically between about 10 and about 500 bases in length, more typically between about 20 
and about 100 bases, and most preferably between about 40 and about 70 bases in length. In 
some embodiments, synthetic nucleic acids include non-natural bases, such as, but by no 
means limited to, inosine. As noted above, nucleic acid analogues may be used as binding 
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sites for hybridization. An example of a suitable nucleic acid analogue is peptide nucleic acid 
(see, e.g., Egholm et al, Nature 363:566-568 (1993); U.S. Patent No. 5,539,083). 
[00180] Probes are preferably selected using an algorithm that takes into account binding 
energies, base composition, sequence complexity, cross-hybridization binding energies, and 
secondary structure. See Friend et al, International Patent Publication WO 01/05935, 
published January 25, 2001; Hughes et al, Nat. Biotech. 19:342-7 (2001). 
[00181] A skilled artisan will also appreciate that positive control probes, e.g., probes known 
to be complementary and hybridizable to sequences in the target polynucleotide molecules, 
and negative control probes, e.g., probes known to not be complementary and hybridizable to 
sequences in the target polynucleotide molecules, should be included on the array. In one 
embodiment, positive controls are synthesized along the perimeter of the array. In another 
embodiment, positive controls are synthesized in diagonal stripes across the array. In still 
another embodiment, the reverse complement for each probe is synthesized next to the 
position of the probe to serve as a negative control. In yet another embodiment, sequences 
from other species of organism are used as negative controls or as "spike-in" controls. 

5.5.2.3 ATTACHING PROBES TO THE SOLID SURFACE 
[00182] The probes are attached to a solid support or surface, which may be made, e.g., from 
glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, gel, or other porous 
or nonporous material. A preferred method for attaching the nucleic acids to a surface is by 
printing on glass plates, as is described generally by Schena et al, Science 210A67-A1Q 

(1995) . This method is especially useful for preparing microarrays of cDNA (See also, 
DeRisi et al, Nature Genetics 14:457-460 (1996); Shalon et al, Genome Res. 6 :639-645 

(1996) ; and Schena et al, Proc. Natl Acad Set U.S.A. 93:10539-1 1286 (1995)). 
[001 83] A second preferred method for making microarrays is by making high-density 
oligonucleotide arrays. Techniques are known for producing arrays containing thousands of 
oligonucleotides complementary to defined sequences, at defined locations on a surface using 
photolithographic techniques for synthesis in situ (see, Fodor et al, 1991, Science 251 :767~ 
773; Pease et al, 1994, Proc. Natl Acad. Sci. U.S.A. 91:5022-5026; Lockhart et al, 1996, 
Nature Biotechnology 14:1675; U.S. Patent Nos. 5,578,832; 5,556,752; and 5,510,270) or 
other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et 
al, Biosensors & Bioelectronics 1 1 :687-690). When these methods are used, 
oligonucleotides (e.g., 60-mers) of known sequence are synthesized directly on a surface such 
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as a derivatized glass slide. Usually, the array produced is redundant, with several 
oligonucleotide molecules per RNA. 

[00184] Other methods for making microarrays, e.g., by masking (Maskos and Southern, 
1992, Nuc. Acids. Res. 20:1679-1684), may also be used. In principle, and as noted supra, 
any type of array, for example, dot blots on a nylon hybridization membrane (see Sambrook 
et al, Molecular Cloning - A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring 
Harbor Laboratory, Cold Spring Harbor, New York (1989)) could be used. However, as will 
be recognized by those skilled in the art, very small arrays will frequently be preferred 
because hybridization volumes will be smaller. 

[00185] In one embodiment, the arrays of the present invention are prepared by synthesizing 
polynucleotide probes on a support. In such an embodiment, polynucleotide probes are 
attached to the support covalently at either the 3' or the 5' end of the polynucleotide. 
[00186] In a particularly preferred embodiment, microarrays of the invention are 
manufactured by means of an ink jet printing device for oligonucleotide synthesis, e.g., using 
the methods and systems described by Blanchard in U.S. Pat. No. 6,028,189; Blanchard et 
al., 1996, Biosensors and Bioelectronics 11:687-690; Blanchard, 1998, in Synthetic DNA 
Arrays in Genetic Engineering, Vol. 20, J.K. Setlow, Ed., Plenum Press, New York at pages 
1 1 1-123. Specifically, the oligonucleotide probes in such microarrays are preferably 
synthesized in arrays, e.g., on a glass slide, by serially depositing individual nucleotide bases 
in "microdroplets" of a high surface tension solvent such as propylene carbonate. The 
microdroplets have small volumes (e.g., 100 pL or less, more preferably 50 pL or less) and 
are separated from each other on the microarray (e.g., by hydrophobic domains) to form 
circular surface tension wells which define the locations of the array elements (i.e., the 
different probes). Microarrays manufactured by this ink-jet method are typically of high 
density, preferably having a density of at least about 2,500 different probes per 1 cm 2 . The 
polynucleotide probes are attached to the support covalently at either the 3' or the 5' end of 
the polynucleotide. 

5.5.2.4 TARGET POLYNUCLEOTIDE MOLECULES 
[00187] The polynucleotide molecules which may be analyzed by the present invention (the 
"target polynucleotide molecules") may be from any clinically relevant source, but are 
expressed RNA or a nucleic acid derived therefrom {e.g., cDNA or amplified RNA derived 
from cDNA that incorporates an RNA polymerase promoter), including naturally occurring 
nucleic acid molecules, as well as synthetic nucleic acid molecules. In one embodiment, the 
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target polynucleotide molecules comprise RNA, including, but by no means limited to, total 
cellular RNA, poly(A) + messenger RNA (mRNA) or fraction thereof, cytoplasmic mRNA, or 
RNA transcribed from cDNA (i.e., cRNA; see, e.g., Linsley & Schelter, U.S. Patent 
Application No. 09/41 1,074, filed October 4, 1999, or U.S. Patent Nos. 5,545,522, 5,891,636, 

or 5,716,785). Methods for preparing total and poly(A) + RNA are well known in the art, and 
are described generally, e.g., in Sambrook et ah, Molecular Cloning - A Laboratory 
Manual (2nd Ed.), Vols. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, New 
York (1989). In one embodiment, RNA is extracted from cells of the various types of interest 
in this invention using guanidinium thiocyanate lysis followed by CsCl centrifugation 
(Chirgwin et al, 1979, Biochemistry 18:5294-5299). In another embodiment, total RNA is 
extracted using a silica gel-based column, commercially available examples of which include 
RNeasy (Qiagen, Valencia, California) and StrataPrep (Stratagene, La Jolla, California). In 
an alternative embodiment, which is preferred for S. cerevisiae, RNA is extracted from cells 
using phenol and chloroform, as described in Ausubel et al, eds., 1989, CURRENT 
Protocols in Molecular Biology, Vol. Ill, Green Publishing Associates, Inc., John Wiley 

& Sons, Inc., New York, at pp. 13.12.1-13.12.5). Poly(A) + RNA can be selected, e.g., by 
selection with oligo-dT cellulose or, alternatively, by oligo-dT primed reverse transcription of 
total cellular RNA. In one embodiment, RNA can be fragmented by methods known in the 
art, e.g., by incubation with ZnCk, to generate fragments of RNA. In another embodiment, 
the polynucleotide molecules analyzed by the invention comprise cDNA, or PGR products of 
amplified RNA or cDNA. 

[00188] In one embodiment, total RNA, mRNA, or nucleic acids derived therefrom, is 
isolated from a sample taken from a person afflicted with breast cancer. Target 
polynucleotide molecules that are poorly expressed in particular cells may be enriched using 
normalization techniques (Bonaldo et al., 1996, Genome Res. 6:791-806). 
[00189] As described above, the target polynucleotides are detectably labeled at one or more 
nucleotides. Any method known in the art may be used to detectably label the target 
polynucleotides. Preferably, this labeling incorporates the label uniformly along the length of 
the RNA, and more preferably, the labeling is carried out at a high degree of efficiency. One 
embodiment for this labeling uses oligo-dT primed reverse transcription to incorporate the 
label; however, conventional methods of this method are biased toward generating 3' end 
fragments. Thus, in a preferred embodiment, random primers (e.g., 9-mers) are used in 
reverse transcription to uniformly incorporate labeled nucleotides over the full length of the 
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target polynucleotides. Alternatively, random primers may be used in conjunction with PCR 
methods or T7 promoter-based in vitro transcription methods in order to amplify the target 
polynucleotides. 

[00190] In a preferred embodiment, the detectable label is a luminescent label. For example, 
fluorescent labels, bioluminescent labels, chemiluminescent labels, and colorimetric labels 
may be used in the present invention. In a highly preferred embodiment, the label is a 
fluorescent label, such as a fluorescein, a phosphor, a rhodamine, or a polymethine dye 
derivative. Examples of commercially available fluorescent labels include, for example, 
fluorescent phosphoramidites such as FluorePrime (Amersham Pharmacia, Piscataway, N.J.), 
Fluoredite (Millipore, Bedford, Mass.), FAM (ABI, Foster City, Calif.), and Cy3 or Cy5 
(Amersham Pharmacia, Piscataway, N.J.). In another embodiment, the detectable label is a 
radiolabeled nucleotide. 

[00191] In a further preferred embodiment, target polynucleotide molecules from a patient 
sample are labeled differentially from target polynucleotide molecules of a standard. The 
standard can comprise target polynucleotide molecules from normal individuals (i.e., those 
not afflicted with breast cancer). In a highly preferred embodiment, the standard comprises 
target polynucleotide molecules pooled from samples from normal individuals or tumor 
samples from individuals having sporadic-type breast tumors. In another embodiment, the 
target polynucleotide molecules are derived from the same individual, but are taken at 
different time points, and thus indicate the efficacy of a treatment by a change in expression 
of the markers, or lack thereof, during and after the course of treatment (i.e., chemotherapy, 
radiation therapy or cryotherapy), wherein a change in the expression of the markers from a 
poor prognosis pattern to a good prognosis pattern indicates that the treatment is efficacious. 
In this embodiment, different timepoints are differentially labeled. 

5.5.2.5 HYBRIDIZATION TO MICROARRAYS 
[00192] Nucleic acid hybridization and wash conditions are chosen so that the target 
polynucleotide molecules specifically bind or specifically hybridize to the complementary 
polynucleotide sequences of the array, preferably to a specific array site, wherein its 
complementary DNA is located. 

[00193] Arrays containing double-stranded probe DNA situated thereon are preferably 
subjected to denaturing conditions to render the DNA single-stranded prior to contacting with 
the target polynucleotide molecules. Arrays containing single-stranded probe DNA (e.g., 
synthetic oligodeoxyribonucleic acids) may need to be denatured prior to contacting with the 
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target polynucleotide molecules, e.g., to remove hairpins or dimers which form due to self 
complementary sequences. 

[00194] Optimal hybridization conditions will depend on the length (e.g., oligomer versus 
polynucleotide greater than 200 bases) and type (e.g., RNA, or DNA) of probe and target 
nucleic acids. One of skill in the art will appreciate that as the oligonucleotides become 
shorter, it may become necessary to adjust their length to achieve a relatively uniform 
melting temperature for satisfactory hybridization results. General parameters for specific 
(i.e., stringent) hybridization conditions for nucleic acids are described in Sambrook et al, 
Molecular Cloning - A Laboratory Manual (2nd Ed.), Vols. 1-3, Cold Spring Harbor 
Laboratory, Cold Spring Harbor, New York (1989), and in Ausubel et al, CURRENT 
Protocols in Molecular Biology, vol. 2, Current Protocols Publishing, New York 
(1994). Typical hybridization conditions for the cDNA microarrays of Schena et al. are 
hybridization in 5 X SSC plus 0.2% SDS at 65°C for four hours, followed by washes at 25°C 
in low stringency wash buffer (1 X SSC plus 0.2% SDS), followed by 10 minutes at 25°C in 
higher stringency wash buffer (0.1 X SSC plus 0.2% SDS) (Schena et al, Proc. Natl. Acad. 
Set U.S.A. 93:10614 (1993)). Useful hybridization conditions are also provided in, e.g., 
Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier Science Publishers 
B.V.; and Kricka, 1992, Nonisotopic DNA Probe Techniques , Academic Press, San 
Diego, CA. 

[00195] Particularly preferred hybridization conditions include hybridization at a temperature 
at or near the mean melting temperature of the probes (e.g. , within 5 1°C, more preferably 
within 21°C) in 1 M NaCl, 50 mM MES buffer (pH 6.5), 0.5% sodium sarcosine and 30% 
forrnamide. 

5.5.2.6 SIGNAL DETECTION AND DATA ANALYSIS 
[00196] When fluorescently labeled probes are used, the fluorescence emissions at each site 
of a microarray may be, preferably, detected by scanning confocal laser microscopy. In one 
embodiment, a separate scan, using the appropriate excitation line, is carried out for each of 
the two fluorophores used. Alternatively, a laser may be used that allows simultaneous 
specimen illumination at wavelengths specific to the two fluorophores and emissions from 
the two fluorophores can be analyzed simultaneously (see Shalon et al, 1996, "A DNA 
microarray system for analyzing complex DNA samples using two-color fluorescent probe 
hybridization," Genome Research 6:639-645, which is incorporated by reference in its 
entirety for all purposes). In a preferred embodiment, the arrays are scanned with a laser 
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fluorescent scanner with a computer controlled X-Y stage and a microscope objective. 
Sequential excitation of the two fluorophores is achieved with a multi-line, mixed gas laser 
and the emitted light is split by wavelength and detected with two photomultiplier tubes. 
Fluorescence laser scanning devices are described in Schena et ah, Genome Res. 6:639-645 
(1996), and in other references cited herein. Alternatively, the fiber-optic bundle described 
by Ferguson et ah , Nature Biotech 14: 1681-1684 (1996), may be used to monitor mRNA 
abundance levels at a large number of sites simultaneously. 

[00197] Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., 
using a 12 or 16 bit analog to digital board. In one embodiment the scanned image is 
despeckled using a graphics program {e.g., Hijaak Graphics Suite) and then analyzed using an 
image gridding program that creates a spreadsheet of the average hybridization at each 
wavelength at each site. If necessary, an experimentally determined correction for "cross 
talk" (or overlap) between the channels for the two fluors may be made. For any particular 
hybridization site on the transcript array, a ratio of the emission of the two fluorophores can 
be calculated. The ratio is independent of the absolute expression level of the cognate gene, 
but is useful for genes whose expression is significantly modulated in association with the 
different breast cancer-related condition. 

5.6 THERAPEUTIC REGIMENS SPECIFIC TO PATIENT SUBSETS 
[00198] The benefit of identifying subsets of individuals that have a common condition, 
followed by identification of sets of genes informative for those particular subsets of 
individuals, is that such subdivision and identification tends to more accurately identify the 
subset of genes responsible for, or most closely associated with, a particular form of the 
condition. For example, breast cancer is a complex condition brought about by several 
different molecular mechanisms. ER+ individuals, particularly ER4-, ER/AGE high 
individuals, show an increased level of expression of cell cycle-control genes, and the 
expression of these genes is highly informative for prognosis in this patient subset {see 
Examples). In ER~ individuals, however, the expression of these genes is not informative for 
prognosis. 

[00199] The set of informative markers, therefore, can be used to assign a particular course 
of therapy to an individual, e.g., an individual having breast cancer, depending upon the 
condition subset into which the individual is classified. In one embodiment, therefore, the 
invention provides a method of assigning a course of therapy to an individual having a 
condition, said method comprising classifying the individual into one of a plurality of subsets 



-81 - 



WO 2005/086891 



PCT/US2005/007894 



of a condition, wherein a plurality of informative genes has been identified for at least one of 
said subsets; and assigning a course of therapy known or suspected to be effective for treating 
the subset of the condition associated with those genes. In a specific embodiment, said 
condition is breast cancer, said patient subset is ER+, ER/AGE high status, and said course of 
therapy comprises the administration of one or more compounds known or suspected to be 
effective at arresting the cell cycle. In a more specific embodiment, said one or more 
compounds comprises taxol or a vinca alkaloid. 

[00200] Of course, any course of therapy selected or assigned on the basis of the above 
phenotypes and gene expression may be supplemented by other treatments or courses of 
therapy relevant to or known or suspected to be effective in the treatment of the condition. 
For example, the treatment of breast cancer may additionally comprise surgery, either tissue- 
preserving or radical, radiation treatment, chemotherapy other than that suggested by gene 
expression analysis, or any other therapy or treatment known or suspected to be effective. 

5.7 CLINICAL TRIALS AND EPIDEMIOLOGICAL STUDIES 
[00201] The method of the present invention may also be used to assign individuals to 
categories within a clinical trial, epidemiological study or the like. For example, individuals 
may be distinguished according to a characteristic of a condition, such as the presence or 
absence of specific proteins (e.g., estrogen receptor) or tissue structures (e.g., lymph nodes), 
and with prognosis, and the results of the trial correlated with prognosis. In a specific 
example, the condition is breast cancer, the characteristic is the presence of the estrogen 
receptor, and the outcome is prognosis is the expected reoccurrence or non-reoccurrence of 
metastases within a given period, for example, five years, after initial diagnosis. In another 
specific example, the condition is obesity, the characteristics are 24-hour energy expenditure, 
and the prognosis is the expected occurrence of heart disease or diabetes. In another specific 
example, the condition is a neurodegenerative disease, the characteristic is exposure to a 
particular range of concentration of an environmental toxin, and the prognosis is expected 
occurrence or degree of loss of motor function. In each case, the characteristics and expected 
outcome are used to assign the individual to a category within a clinical trial or 
epidemiological study. 

[00202] Thus, the invention provides a method for assigning an individual to one of a 
plurality of categories in a clinical trial, comprising classifying the individual into one of a 
plurality of condition categories differentiated by at least one genotypic or phenotypic 
characteristic of the condition; determining the level of expression, in a sample derived from 
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said individual, of a plurality of genes informative for said condition category; determining 
whether said level of expression of said plurality of genes indicates that the individual has a 
good prognosis or a poor prognosis; and assigning the individual to a category in a clinical 
trial on the basis of prognosis. 

[00203] In a specific embodiment, the invention provides a method of assigning an 
individual to a category in a breast cancer clinical trial, said method comprising: (a) 
classifying said individual as ER~ BRCA1, ER~, sporadic; ER+, ER/AGE high; ER+, 
ER/AGE low, LN+; or ER+, ER/AGE low, LIST; (b) determining for said individual the level 
of expression of at least two genes for which markers are listed in Table 1 if said individual 
is classified as ER~, BRCA1; Table 2 if said individual is classified as ER", sporadic; Table 3 
if said individual is classified as ER+ } ER/AGE high; Table 4 if said individual is classified 
as ER+, ER/AGE low, LN+; or Table 5 if said individual is classified as ER+, ER/AGE low, 
LIST; (c) determining whether said individual has a pattern of expression of said at least two 
genes that correlates with a good prognosis or a poor prognosis; and (d) assigning said 
individual to at least one category in a clinical trial if said individual has a good prognosis, 
and assigning said individual to a second category in said clinical trial if said individual has a 
poor prognosis. In a more specific embodiment, said individual is additionally assigned to a 
category in said clinical trial on the basis of the classification of said individual as determined 
in step (a). In another more specific embodiment, said individual is additionally assigned to a 
category in said clinical trial on the basis of any other clinical, phenotypic or genotypic 
characteristic of breast cancer. In another more specific embodiment, the method 
additionally comprises determining in said cell sample the level of expression, relative to a 
control, of a second plurality of genes for which markers are not found in Tables 1-5, wherein 
said second plurality of genes is informative for prognosis of breast cancer, and determining 
from the expression of said second plurality of genes, in addition to said first plurality of 
genes, whether said individual has a good prognosis or a poor prognosis. 

5.8 KITS 

[00204] The present invention further provides for kits comprising the marker sets described 
above. The components of the kits of the present invention are preferably contained in sealed 
containers. In a preferred embodiment, the kit comprises a microarray ready for 
hybridization to target polynucleotide molecules. In specific embodiments, the kit may 
comprise any of the microarrays described in detail in Section 5.5.2. Where proteins are the 
target molecules, the kit preferably comprises a plurality of antibodies for binding to specific 
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condition-related proteins, and means for identifying such binding (e.g., means for 
performing a sandwich assay, ELISA, RIA, etc.). Such antibodies may be provided, for 
example, individually or as part of an antibody array. The kit may additionally comprise 
software for the data analyses described above, as described in detail in Section 5.9. The kit 
preferably contains one or more control samples. Such a control sample may be an artificial 
population of marker-related or marker-derived polynucleotides suitable for hybridization to 
a microarray, wherein the markers are related to or relevant to the condition of interest (for 
example, breast cancer). The control may also, or alternatively, be a set of expression values 
stored on a computer disk or other storage medium. 

[00205] The kits of the invention may be primarily diagnostic in nature; that is, they may 
assist a physician or researcher in determining a characteristic, for example, the prognosis, of 
a condition of interest, the likely response to a therapeutic regimen, the likely outcome of 
exposure to an environmental condition, such as toxin exposure, etc. The kits of the 
invention may also be used to classify individuals, for example, to place individuals into 
different groups in a clinical trial. The use of each kit is determined by the markers, 
microarrays, controls, etc. included. 

[00206] COMPUTER-FACILITATED ANALYSISThe analytic methods described in the 
previous sections can be implemented by use of the following computer systems and 
according to the following programs and methods. A computer system comprises internal 
components linked to external components. The internal components of a typical computer 
system include a processor element interconnected with a main memory. For example, the 
computer system can be based on an Intel 8086-, 803 86-, 80486-, Pentium™, or Pentium™- 
based processor with preferably 32 MB or more of main memory. The computer system may 
also be a Macintosh or a Macintosh-based system, but may also be a minicomputer or 
mainframe. 

[00207] The external components preferably include mass storage. This mass storage can be 
one or more hard disks (which are typically packaged together with the processor and 
memory). Such hard disks are preferably of 1 GB or greater storage capacity. Other external 
components include a user interface device, which can be a monitor, together with an 
inputting device, which can be a "mouse", or other graphic input devices, and/or a keyboard. 
A printing device can also be attached to the computer. 

[00208] Typically, a computer system is also linked to network link, which can be part of an 
Ethernet link to other local computer systems, remote computer systems, or wide area 
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communication networks, such as the Internet. This network link allows the computer 
system to share data and processing tasks with other computer systems. 
[00209] Loaded into memory during operation of this system are several software 
components, which are both standard in the art and special to the instant invention. These 
software components collectively cause the computer system to function according to the 
methods of this invention. These software components are typically stored on the mass 
storage device. A software component comprises the operating system, which is responsible 
for managing computer system and its network interconnections. This operating system can 
be, for example, of the Microsoft Windows® family, such as Windows 3.1, Windows 95, 
Windows 98, Windows 2000, or Windows NT, or may be of the Macintosh OS family, or 
may be UNIX, a UNIX derivative such as LINUX, or an operating system specific to a 
minicomputer or mainframe. The software component represents common languages and 
functions conveniently present on this system to assist programs implementing the methods 
specific to this invention. Many high or low level computer languages can be used to 
program the analytic methods of this invention. Instructions can be interpreted during run- 
time or compiled. Preferred languages include C/C++, FORTRAN and JAVA. Most 
preferably, the methods of this invention are programmed in mathematical software packages 
that allow symbolic entry of equations and high-level specification of processing, including 
some or all of the algorithms to be used, thereby freeing a user of the need to procedurally 
program individual equations or algorithms. Such packages include Mathlab from 
Mathworks (Natick, MA), Mathematica® from Wolfram Research (Champaign, IL), or S- 
Plus® from Math Soft (Cambridge, MA). Specifically, the software component includes the 
analytic methods of the invention as programmed in a procedural language or symbolic 
package. 

[00210] The software to be included with the kit comprises the data analysis methods of the 
invention as disclosed herein. In particular, the software may include mathematical routines 
for marker discovery, including the calculation of similarity values between clinical 
categories (e,g. 9 prognosis) and marker expression. The software may also include 
mathematical routines for calculating the similarity between sample marker expression and 
control marker expression, using array-generated fluorescence data, to determine the clinical 
classification of a sample. 

[0021 1] Additionally, the software may also include mathematical routines for determining 
the prognostic outcome, and recommended therapeutic regimen, for an individual with a 
condition of interest. In the specific example of breast cancer, the mathematical routines 
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would determine the prognostic outcome and recommended therapeutic regimen for an 
individual having breast cancer. Such breast cancer-specific software would include 
instructions for the computer system's processor to receive data structures that include the 
level of expression of five or more of the marker genes listed in any of Tables 1-5 in a breast 
cancer tumor sample obtained from the breast cancer patient; the mean level of expression of 
the same genes in a control or template; and the breast cancer patient's clinical information, 
including age ? lymph node status and ER status. The software may additionally include 
mathematical routines for transforming the hybridization data and for calculating the 
similarity between the expression levels for the marker genes in the patient's breast cancer 
tumor sample and a control or template. In a specific embodiment, the software includes 
mathematical routines for calculating a similarity metric, such as a coefficient of correlation, 
representing the similarity between the expression levels for the marker genes in the patient's 
breast cancer tumor sample and the control or template, and expressing the similarity as that 
similarity metric. 

[00212] The software preferably would include decisional routines that integrate the patient's 
clinical and marker gene expression data, and recommend a course of therapy. In one 
embodiment, for example, the software causes the processor unit to receive expression data 
for prognosis-related genes in the patient's tumor sample, calculate a metric of similarity of 
these expression values to the values for the same genes in a template or control, compare 
this similarity metric to a pre-selected similarity metric threshold or thresholds that 
differentiate prognostic groups, assign the patient to the prognostic group, and, on the basis of 
the prognostic group, assign a recommended therapeutic regimen. In a specific example, the 
software additionally causes the processor unit to receive data structures comprising clinical 
information about the breast cancer patient. In a more specific example, such clinical 
information includes the patient's age, estrogen receptor status, and lymph node status. 
[00213] The software preferably causes the processor unit to receive data structures 
comprising relevant phenotypic and/or genotypic characteristics of the particular condition of 
interest, and/or of an individual having that condition, and classifies the individual into a 
condition subset according to those characteristics. The software then causes the processor to 
receive values for subset-specific markers, to calculate a metric of similarity of the values 
associated with those markers (e.g., level, abundance, activity, etc.) from the individual to a 
control, compare this similarity metric to a pre-selected similarity metric threshold or 
thresholds that differentiate prognostic groups, assign the patient to a prognostic group, and, 
on the basis of the prognostic group, assign a recommended therapeutic regimen. In the 
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specific example of breast cancer and a breast cancer patient, the software, in one 
embodiment, causes the processor unit to receive data structures comprising the patient's age, 
estrogen receptor status, and lymph node status, and on the basis of this data, to classify the 
patient into one of the following patient subsets: ER~~, sporadic; ER~~, BRCA1; ER+, 
AR/AGE high; ER+, ER/AGE low, LN+; or ER+, ER/AGE low, LN~ The software then 
causes the processor to receive expression values for subset-specific prognosis-informative 
gene expression in the patient's tumor sample, calculate a metric of similarity of these 
expression values to the values for the same genes in a patient subset-specific template or 
control, compare this similarity metric to a pre-selected similarity metric threshold or 
thresholds that differentiate prognostic groups, assign the patient to the prognostic group, 
and, on the basis of the prognostic group, assign a recommended therapeutic regimen. 
[00214] Where the control is an expression template comprising expression values for 
marker genes within a group of patients, e.g., breast cancer patients, the control can comprise 
either hybridization data obtained at the same time (i.e., in the same hybridization 
experiment) as the patient's individual hybridization data, or can be a set of hybridization or 
marker expression values stores on a computer, or on computer-readable media. If the latter 
is used, new patient hybridization data for the selected marker genes, obtained from initial or 
follow-up tumor samples, or suspected tumor samples, can be compared to the stored values 
for the same genes without the need for additional control hybridizations. However, the 
software may additionally comprise routines for updating the control data set, e.g., to add 
information from additional breast cancer patients or to remove existing members of the 
control data set, and, consequently, for recalculating the average expression level values that 
comprise the template. In another specific embodiment, said control comprises a set of 
single-channel mean hybridization intensity values for each of said at least five of said genes, 
stored on a computer-readable medium. 

[00215] Clinical data relating to a breast cancer patient, or a patient having another type of 
condition, and used by the computer program products of the invention, can be contained in a 
database of clinical data in which information on each patient is maintained in a separate 
record, which record may contain any information relevant to the patient, the patient's 
medical history, treatment, prognosis, or participation in a clinical trial or study, including 
expression profile data generated as part of an initial diagnosis or for tracking the progress of 
the condition, for example, breast cancer, during treatment. 

[00216] Thus, one embodiment of the invention provides a computer program product for 
classifying a breast cancer patient according to prognosis, the computer program product for 
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use in conjunction with a computer having a memory and a processor, the computer program 
product comprising a computer readable storage medium having a computer program 
mechanism encoded thereon, wherein said computer program product can be loaded into the 
one or more memory units of a computer and causes the one or more processor units of the 
computer to execute the steps of (a) receiving a first data structure comprising said breast 
cancer patient's age, ER status, LN status and tumor type; (b) classifying said patient as ER~~, 
sporadic; ER~, BRCA1; ER+, ER/AGE high; ER+, ER/AGE low, LN+; or ER+, ER/AGE 
low, LN~; (c) receiving a first data structure comprising the level of expression of at least two 
genes in a cell sample taken from said breast cancer patient wherein markers for said at least 
two genes are listed in Table 1 if said patient is classified as ER~ sporadic; Table 2 if said 
patient is classified as ER~~, sporadic; Table 3 if said patient is classified as ER+, ER/AGE 
high; Table 4 if said patient is classified as ER+, ER/AGE low, LN+; or Table 5 if said 
patient is classified as ER+, ER/AGE high, LIST; (d) determining the similarity of the level of 
expression of said at least two genes to control levels of expression of said at least two genes 
to obtain a patient similarity value; (e) comparing said patient similarity value to selected first 
and second threshold values of similarity of said level of expression of said genes to said 
control levels of expression to obtain first and second similarity threshold values, 
respectively, wherein said second similarity threshold indicates greater similarity to said 
control levels of expression than does said first similarity threshold; and (f) classifying said 
breast cancer patient as having a first prognosis if said patient similarity value exceeds said 
first and said second threshold similarity values, a second prognosis if said patient similarity 
value exceeds said first threshold similarity value but does not exceed said second threshold 
similarity value, and a third prognosis if said patient similarity value does not exceed said 
first threshold similarity value or said second threshold similarity value. In a specific 
embodiment of said computer program product, said first threshold value of similarity and 
said second threshold value of similarity are values stored in said computer. In another more 
specific embodiment, said first prognosis is a "very good prognosis," said second prognosis is 
an "intermediate prognosis," and said third prognosis is a "poor prognosis," and wherein said 
computer program mechanism may be loaded into the memory and further cause said one or 
more processor units of said computer to execute the step of assigning said breast cancer 
patient a therapeutic regimen comprising no adjuvant chemotherapy if the patient is lymph 
node negative and is classified as having a good prognosis or an intermediate prognosis, or 
comprising chemotherapy if said patient has any other combination of lymph node status and 
expression profile. In another specific embodiment, said computer program mechanism may 
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be loaded into the memory and further cause said one or more processor units of the 
computer to execute the steps of receiving a data structure comprising clinical data specific to 
said breast cancer patient. In a more specific embodiment, said single-channel hybridization 
intensity values are log transformed. The computer implementation of the method, however, 
may use any desired transformation method. In another specific embodiment, the computer 
program product causes said processing unit to perform said comparing step (e) by 
calculating the difference between the level of expression of each of said genes in said cell 
sample taken from said breast cancer patient and the level of expression of the same genes in 
said control. In another specific embodiment, the computer program product causes said 
processing unit to perform said comparing step (e) by calculating the mean log level of 
expression of each of said genes in said control to obtain a control mean log expression level 
for each gene, calculating the log expression level for each of said genes in a breast cancer 
sample from said breast cancer patient to obtain a patient log expression level, and 
calculating the difference between the patient log expression level and the control mean log 
expression for each of said genes. In another specific embodiment, the computer program 
product causes said processing unit to perform said comparing step (e) by calculating 
similarity between the level of expression of each of said genes in said cell sample taken from 
said breast cancer patient and the level of expression of the same genes in said control, 
wherein said similarity is expressed as a similarity value. In more specific embodiment, said 
similarity value is a correlation coefficient. The similarity value may, however, be expressed 
as any art-known similarity metric. 

[00217] Of course, the above breast cancer-specific examples are not limiting; analogous 
computer systems, software, and data analysis methods may be utilized for any condition of 
interest. For example, analogous software may be used to determine the prognosis of any 
other type of cancer, or of any other non-cancer diseases or conditions, using markers, 
expression level data and controls specific for that cancer, non-cancer disease or condition. 
[00218] In an exemplary implementation, to practice the methods of the present invention, a 
user first loads experimental data into the computer system. These data can be directly 
entered by the user from a monitor, keyboard, or from other computer systems linked by a 
network connection, or on removable storage media such as a CD-ROM, floppy disk (not 
illustrated), tape drive (not illustrated), ZIP® drive (not illustrated) or through the network. 
Next the user causes execution of expression profile analysis software which performs the 
methods of the present invention. 
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[00219] In another exemplary implementation, a user first loads experimental data and/or 
databases into the computer system. This data is loaded into the memory from the storage 
media or from a remote computer, preferably from a dynamic geneset database system, 
through the network. Next the user causes execution of software that performs the steps of 
the present invention. 

[00220] Additionally, because the data obtained and analyzed in the software and computer 
system products of the invention may be confidential, the software and/or computer system 
preferably comprises access controls or access control routines, such as password protection 
and preferably, particularly if information is to be transmitted between computers, for 
example, over the Internet, encryption of the data by a suitable encryption algorithm (e.g., 
PGP). 

[00221] Alternative computer systems and software for implementing the analytic methods 
of this invention will be apparent to one of skill in the art and are intended to be 
comprehended within the accompanying claims. In particular, the accompanying claims are 
intended to include the alternative program structures for implementing the methods of this 
invention that will be readily apparent to one of skill in the art. 

6. EXAMPLE: IDENTIFICATION OF PHENOTYPIC SUBSETS AND 

INFORMATIVE GENESETS FOR EACH 
[00222] Materials and Methods 
Tumor Samples : 

[00223] 311 cohort samples were collected from breast cancer patients. Selection criteria for 
sporadic patients (i.e., those not identified as having a BRCAl-type tumor; n = 291) included: 
primary invasive breast carcinoma less than 5 cm (Tl or T2); no axillary metastases (NO); 
age at diagnosis of less than 55 years; calendar year of diagnosis 1983-1996; and no previous 
malignancies. All patients were treated by modified radical mastectomy or breast-conserving 
treatment. See van't Veer et ah, Nature 415:530 (2002). Selection criteria for hereditary 
(i.e., BRCAl-type; n = 20) tumors included: carriers of germline mutation in BRCA1 or 
BRCA2, and primary invasive breast carcinoma, van't Veer, supra. Additionally, for 
development of a classifier for the BRCA1 group, 14 BRCA1 samples previously identified 
(see van't Veer, supra) were added to the 20 BRCA1 type samples to increase sample size. 
Those 14 samples also satisfy the conditions that they are ER negative and age less than 55 
years old. 

[00224] Data analysis: 
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[00225] Sample sub-grouping: As shown in FIG. 1, tumor samples were first divided into 
ER+ and ER~ branches since this is the dominant gene expression pattern. In the ER~~ 
branch, the samples were further divided into "BRCA1 mutation like" and "Sporadic like" 
categories using the expression templates and 100 genes previously identified as optimal for 
determining BRCA1 status. See van't Veer et ah, Nature 415:530 (2002). In the ER+ 
category, samples were divided by ER vs. age distribution (see below) into two groups, 
"ER/AGE low" and "ER/AGE high." Within the "ER/AGE low" group, samples were 
further divided according to the lymph node status into two sub-groups: lymph node negative 
(0 lymph nodes; LN-) and positive (> 0 lymph nodes; LN+) group. 

[00226] The result of these divisions was five distinctive sub-groups: "ER~~, sporadic" (n = 
52), "ER", BRCA1" (n = 34), "ER+, ER/AGE high" (n = 83), "ER+, ER/AGE low, LN~" (n 
- 81), and "ER+, ER/AGE low, LN+" (n = 75). A few samples with a specific ER vs. age 
distribution in "ER+, ER/AGE low, LN+" group were further excluded to develop a 
classifier, see below for details. 

[00227] Estrogen receptor level : Estrogen receptor gene expression level was measured by a 
60mer oligo-nucleotide on a microarray. Since every individual sample was compared to a 
pool of all samples, the ratio to pool was used to measure the relative level. A threshold of 
-0.65 on logio(ratio) was used to separate the ER+ group from ER - group. See van't Veer et 
a/., Nature 415:530 (2002). 

[00228] Grouping by ER vs. age distribution : Samples were not umformly distributed in ER 
vs. age space among the ER+ samples (FIG. 2). First, it appeared that the ER level increases 
with age, as there were few samples from young individuals having a high ER expression 
level. For example, in the 35 to 40 years age group, samples having a log(ratio) of ER > 0.2 
are relatively few as compared to the 40 to 45 age group. In the set of samples used, the 40 < 
age < 45 group contains 30 samples having log(ratio) ER values between —0.2 to 0.2, and 28 
samples having values greater than 0.2, whereas the 35 < age < 40 group includes 24 samples 
with values between -0.2 to 0.2, but only 6 samples with values of greater than 0.2 (Fisher's 
exact test P-value: 1%). The increasing ER level with age may simply due to the fact that 
estrogen levels decrease with age, and the estrogen receptor level rises in compensation. 
[00229] There also appeared to be at least two groups of patients, as indicated by the solid 
line separating the two in FIG. 2A. A bimodality test of the separation indicated by the solid 
line yielded P-value < 10 -4 . Each of these two groups has its own trend between the ER level 
and age. The solid line can be approximated by ER = 0.1 (age - 42.5). Patients having values 
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above the solid line are referred to as the "ER/AGE high" group, and the patients below the 
line as the "ER/AGE low" group. 
[00230] Prognosis in each group: 

[00231] Feature selection and performance evaluation : For the prognosis in each group, non- 
informative genes were filtered in each group of patients. Specifically, only genes with 
|logio(ratio)| > logi 0 (2) and P-value (for log(ratio) ^ 0) < 0.01 in more than 3 experiments 
were kept. This step removed all genes that never had any significant change across all 
samples. The second step used a leave-one-out cross validation (LOOCV) procedure to 
optimize the number of reporter genes (features) in the classifier and to estimate the 
performance of the classifier in each group. The feature selection was included inside the 
loop of each LOOCV process. The final "optimal" reporter genes were selected using all of 
the "training samples" as the result of "re-substitution" because one classifier was needed for 
each group. 

[00232] Selection of training samples : Only the samples from patients who had metastases 
within 5 years of initial diagnosis (3 years for "ER~ sporadic" samples; i.e., the "poor 
outcome" group), or who were metastases-free with more than 5 years of follow-up time (i.e., 
the "good outcome" group, were used as the training set. Because the average expression 
levels for informative genes among patients who were metastasis-free, or who had early 
metastases, were used as expression templates for prediction, the training samples for the 
ER+ samples were further limited to those samples that could also be correctly classified by 
the first round of LOOCV process. For the "ER~~, sporadic" samples, no such iteration was 
done because no improvement was observed. For the "ER~, BRCA1" samples, an iteration 
was done, but the training samples in the second iteration were limited to the correctly 
predicted good outcome samples from the first round of LOOCV, and all the poor outcome 
samples with metastases time less than 5 years. Further limitation of the poor outcome 
samples was not performed because of the small number of poor samples and the absence of 
improvement by such limitation. In the first round of LOOCV, except for the "ER~ 
sporadic" group, the number of features was fixed at 50 genes. A patient was predicted to 
have a favorable outcome, that is, no metastases within five years of initial diagnosis, if the 
expression of the reporter genes in a sample from the individual was more similar to the 
"average good profile" than the "average poor profile", and a poor outcome, that is, a 
metastasis within five years, if the expression of the reporter genes in the sample was more 
similar to the "average poor profile" than the "average good profile". 



-92- 



WO 2005/086891 



PCT/US2005/007894 



[00233] The justification for such an iteration operation is threefold. First, biologically, there 
are always a few individuals with specific reasons (different from the vast majority) to stay 
metastases free or to develop metastases. Second, statistically, most groups of patients 
include outliers that don't follow the distribution of the majority of samples. Third, 
methodologically, the iteration operation is very similar to the idea of "boosting", but instead 
of increasing the weights of the samples predicted wrong, emphasis is placed on the well 
behaved samples for selecting features and training the classifier. Since this process was 
used to select "training samples", and the performance was evaluated using the LOOCV 
(including the feature selection) after the training sample being fixed, there is no issue of 
over-fitting involved in our procedures. This method of iteration is thus more likely to reveal 
the dominant mode to metastases within each group. 

[00234] Error rate and odds ratio, threshold in the final LOOCV : Unless otherwise stated, 
the error rate was the average error rate from two populations: (1) the number of poor 
outcome samples misclassified as good outcome samples, divided by the total number of poor 
outcome samples; and (2) the total number of good outcome samples misclassified as poor 
outcome samples, divided by the total number of good samples. Two odds ratios were 
reported for a given threshold: (1) the overall odds ratio and (2) the 5 year odds ratio. The 5 
year odds ratio was calculated from samples from individuals that were metastases free for 
more than five years, and who experienced metastasis within 5 years. The threshold was 
applied to corl — cor2, where "corl" stands for the correlation to the "average good profile" 
in the training set, and "cor2" stands for the correlation to the "average poor profile" in the 
training set. 

[00235] The threshold in the final round of LOOCV was defined using the following steps: 
(1) For each of the N sample i left out for training, features based on the training set were 
selected, (2) given a feature set, an incomplete LOOCV with N-l samples was performed 
(only the "average poor profile" and "average good profile" is varied depending on whether 
the left out sample is in the training set or not), (3) the threshold based on the minimum error 
rate from N-l samples was determined, and that threshold was assigned to sample i in step 
(1), (4) the median threshold from all N samples was taken, and designated the final 
threshold. FIGS. 3-7 present detailed information about classifiers for the 5 groups: "ER~ 
sporadic", "ERT, BRCA1", "ER+, ER/age high", "ER+, ER/age low, LN~", "ER+, ER/age 
low, LN+". Tables 1-5 {see Section 5.3) list the final optimal reporter genes for each of the 5 
classifiers for each of the five patient subsets. Table 6, below, summarizes the performance 
of each of the five classifiers together with thresholds used in each classifier. 
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[00236] Table 6. Performance of classifiers for each patient subset. 



Classifier 


Optimal # 
of Genes 


(C1-C2) 
Threshold 


Metastasis 
Free 


#of 
Samples 


TP 


FP 


FN 


TN 


Odds 
Ratio 


95% C.I. 


ER+, ER/AGE 
high 


50 


1.22 


Overall 


83 


31 


14 


5 


33 


14.61 


4.71- 
45.36 








5 year 


71 


24 


11 


3 


33 


24.00 


6.03- 
95.46 


ER+, ER/AGE 
low, LN- 


65 


0.38 


Overall 


81 


14 


6 


6 


55 


21.39 


5.98- 
76.52 








5 year 


73 


11 


4 


5 


53 


29.15 


6.73- 
126.33 


ER+, ER/AGE 
low, LN+ 


50 


-0.12 


Overall 


56 


7 


4 


6 


39 


11.38 


2.54- 
50.94 








5 year 


48 


5 


4 


3 


36 


15.00 


2.57- 
87.64 


cr\-j sporadic 




-U.U I 


fit. / /"*\ 1 1 

uveraii 




18 


-7 

7 


7 


29 


7.35 


2.16- 
25.04 








5 year 


45 


16 


5 


6 


18 


9.60 


2.45- 
37.58 


ER-, BRCA1 


10 


-0.37 


Overall 


34 


6 


3 


3 


22 


14.67 


2.34- 
92.11 








5 year 


22 


6 


1 


3 


12 


24.00 


2.04- 
282.68 



[00237] TP: True positive 



[00238] FP: False positive 
[00239] FN: False negative 
[00240] TN: True negative 

[00241] Classification method: All classifiers described herein, feature selection and 
optimization were included inside the LOOCV loop. Classifier performance was based on 
the LOOCV results. The profile based on the selected features from each patient was 
compared to the "average good profile" and "average poor profile" (by correlation) to 
determine its predicted outcome. 

[00242] Correlation calculation: The correlation between each gene's expression log(ratio) 
and the endpoint data (final outcome) was calculated using the Pearson's correlation 
coefficient. The correlation between each patient's profile and the "average good profile" 
and "average poor profile" was the cosine product (no mean subtraction). 
[00243] Results: 

[00244] The comprehensive prognosis strategy was employed on microarray expression 
profiles of 31 1 patients diagnosed before age 55 that were all part of previous studies 
establishing and validating a 70-gene prognosis profile. See van 't Veer et al, Nature 
415:530 (2002); van de Vijver et al, K Engl J. Med. 347:1999 (2002). In addition, 14 
known BRCA1 samples from the Nature study were included in defining the prognosis 
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classifier for the BRCA1 group. The overview of the stratifications is shown in FIG. 1. In 
each of the patient subsets, prognosis classifiers were developed and performance was 
evaluated by leave-one-out cross-validation. The biological make up of each of the 
classifiers was also examined. 

[00245] During the process to decide whether a particular clinical parameter should be used 
for the next stratification, our objectives were twofold: (1) identification of homogeneous 
prognosis patterns; and/or (2) improved prognosis in the subsets. There is a subtle balance 
between these two objectives because smaller groups will likely lead to uniform patterns 
within the group but have increasingly limited predictive power. With the exception of the 
BRCA1 subset, each group in our stratification contained 50 or more samples. 
[00246] The first layer of stratification was based on the estrogen receptor level. It was 
previously observed that estrogen receptor expression has a dominant effect on overall gene 
expression in breast cancer as seen in hierarchical clustering, van 't Veer et ah, Nature 
415:530 (2002); Perou et al, Nature 406:141 (2000); Gruvberger et ah, Cancer Res. 61:5979 
(2001). In previous analysis up to 2500 genes were significantly correlated with ER 
expression levels in tumor. See, van 't Veer et al 9 Nature 415:530 (2002). According to the 
threshold defined previously (van de Vijver et al, N. Engl J. Med. 347:1999 (2002)), 
samples were first divided into two groups according to the estrogen receptor level as 
measured by the oligo probe (accession number: NM_000125) on the array; samples with 
log(ratio) > -0.65 belong to the ER+ group, and the rest belong to ER~~ group). This resulted 
in 239 samples in the ER+ group and 72 samples in the ER~ group. 

[00247] In the ER+ branch it was observed that when displaying ER expression level as a 
function of age, at least two subgroups appeared to exist. (In general, any bimodality in the 
clinical data is useful.) The tumors were stratified according this bimodality {see FIG. 2). 
The group of ER+ patients having a high ER/AGE ratio was designated the "ER/AGE high" 
group (83 samples), and the remaining group of patients was designated "ER/AGE low" 
group (156 samples). 

[00248] Within the "ER/age high" group, a group of prognosis reporter genes that highly 
correlated with the outcome is identified {see Table 3). Moreover, the expression of these 
genes appeared to be very homogeneous, as indicated by high similarity in expression among 
those genes. See FIG 2 A. Leave-one-out cross validation including reporter selection 
yielded an odds ratio of 14.6 (95%CI: 4.7-45.4) and 5 year odds ratio of 24.0 (95%CI: 6.0- 
95.5). Examination of those reporter genes reveals they are mostly the cell cycle genes which 
are highly expressed in the poor outcome tumors. It is worth noting that even though this 
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group includes LN+ and LN— individuals, and mixed treatment, the incidence of distant 
metastases is predicted by a biologically uniform set of genes, possibly indicating that 
proliferation is the prime driving force for disease progression. Also even though variation in 
these genes is observed in other tumor subgroups this is generally not correlated with 
outcome in those settings (see below). 

[00249] In the "ER/age low" group, no predictive pattern was found in the whole group; thus, 
the samples were further stratified into LN- (81 samples, referred to as "ER/age low LN~~") 
and LN+ (75 samples, referred to as "ER/age low LN+") group. 

[00250] Within the "ER/age low LN~" group, a group of genes was identified that was 
uniformly co-regulated, and which correlated with the outcome. Leave-one-out cross- 
validation (including feature selection) yielded an odds ratio of 21.4 (95% CI: 6.0-76.5) and 5 
year odds ratio of 29.2 (95% CI: 6.7-126.3). This group of genes is also enriched for 
individual biological functions (see below). 

[00251] For the "ER/age low LN+" subset, an informative set of genes {see Table 4) was 
obtained after exclusion of several samples from older individuals having low ER levels. 
These samples are indicated in FIG. 2A as those lying below the dashed line (approximated 
as ER < 0.1*(age-50). 56 samples remained after the exclusion. This sample set allowed the 
identification of a group of genes with a highly homogeneous pattern that is useful for 
prognosis (overall odds ratio: 11.4 (2.5-50.9), 5 year odds ratio: 15.0 (2.6-87.6)). This 
suggests again that ER vs. age is an important combination for stratifying breast cancer 
patients. The reporter genes involved in this classifier also correlated with the clinical 
measure of the degree of lymphocytic infiltration (data not shown). The prediction in this 
group was not as strong as other positive groups, which may indicate the primary tumor 
carries weaker information about the metastases for this group of patients, and the metastases 
may be started from or influenced by tumors already in lymph nodes. 

[00252] In the ER~ branch, because a portion of the samples are "BRCA 1 -like," it is natural 
to divide the samples into "BRCA1 -like" and "sporadic like". To perform the classification, 
the i?i?C/li/sporadic tumor type classifier described in Roberts et aL 9 "Diagnosis and 
Prognosis of Breast Cancer Patients," International Publication No. WO 02/103320, which is 
hereby incorporated by reference in its entirety, to segregate the ER - cohort samples. 52 out 
of the 72 ER~~ samples were found to be "sporadic like" and 20 were found to be "BRCA1- 
like". Interestingly, the "sporadic like" group was enriched for erbb2 mutations (data not 
shown). 
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[00253] Within the "ER~~, sporadic" group, no homogeneous prognosis pattern was 
identified; however, 20 genes were identified that are highly predictive of the tumor outcome 
(see Table 2). Leave-one-out cross-validation including feature selection yielded an odds 
ratio of 7.4 (95% CI 2.2-25.0) and 5 year odds ratio 9.6 (2.5 - 37.6). This result represents a 
significant improvement in prognosis compared to the previously-identified 70 gene 
prognosis classifier (see Roberts etaL, International Publication No. WO 02/103320; van f t 
Veer et al, Nature 415:530 (2002)) which has no within-group prognostic power for the ER~ 
patient subset. The fact that 20 genes predict outcome and that there is no homogeneous (and 
apparent biological) pattern in this group probably indicates multiple mechanisms of 
metastasis in this group. Gene annotation indicates that genes included may be involved in 
invasion, energy metabolism and other functions. 

[00254] For the "ER~ BRCA1 -like" group, we added 14 BRCA1 mutation carrier samples 
from a previous study were added to increase the number of samples. Those 14 extra 
samples also satisfied the following selection criteria: ER negative and age less than 55 years. 
The leave-one-out cross validation process identified 10 genes that are predictive of final 
outcomes. The overall odds ratio is 14.7 (95% CI: 2.3-92.1) and the 5 year odds ratio is 24.0 
(95% CI: 2.0-282.7). 

[00255] Because no homogeneous gene expression patterns were found in ER~~ branch, the 
predictive power of those genes was further validated. One means of further validation was 
to review the different classifier gene sets for biological interpretations and to identify genes 
within each classifier that gave indications as to the origins of the tumors. 
[00256] The "ER+, ER/AGE high" group yielded a classifier highly enriched for cell cycle 
genes with both Gl/S and G2/M phases represented. In this group, over-expression of 46 of 
the 50 genes was associated with disease progression including all the known cell cycle 
genes. This is consistent with rapid growth being the determinant of metastatic potential. 
Four genes in this classifier were anti-correlated with outcome and cell cycle. One of these 
genes encodes follistatin, which binds to and inhibits activin and other members of the TGFp 
family (Lin et aL, Reproduction 126:133 (2003)), the members of which have many 
functions, including growth stimulation. Tumor grade also accurately predicted metastatic 
potential in this group (overall odds ratio: 5.9, 95% CI: 2.0-1 8.0, 5 year odds ratio: 12.5, 95% 
CI: 2.6-59.3) and was also correlated with the expression level of these genes, which is 
consistent with rate of growth being the primary determinant of disease progression. This set 
of genes had a significantly lower correlation with outcome in the other patient subsets, even 
though coordinate and similarly variable expression was seen. For example, many tumors in 
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the "ER~~, sporadic" group had high cell cycle and low FST expression, but the expression of 
these genes in these groups was minimally correlated with outcome, indicating that growth 
was not the primary determinant of outcome here (see FIGS. 8 A and 8B). 
[00257] The ER+, ER/AGE low, LN~ group yielded a classifier rich in both genes for 
glycolytic enzymes (12 of 56) and genes induced by hypoxia and/or angiogenesis (14 of 56) 
with 5 genes falling into both categories. These genes were positively correlated with poor 
outcome, implying that energy metabolism (glycolysis), angiogenesis and adaptation to 
hypoxia were critical pathways in this subgroup of tumors. None of these genes appeared in 
the classifiers for the other patient subsets, and there was a much reduced predictive value of 
these genes in the other tumors, even though coordinate and similarly variable expression was 
seen (see FIG. 8C and 8D). 

[00258] The implication of the above analyses is that certain well known functions (growth, 
angiogenesis, energy metabolism) are important in certain tumor types and not in others, and 
therefore therapies that target these functions will be likely be similarly effective in some 
tumor subgroups and not in others. For example therapies that target cell cycle progression, 
such as taxol or the vinca alkaloids, may be optimally effective in the ER+, ER/AGE high 
group, where overexpression of cell cycle genes predominates in the classifier. In contrast, 
tumor subgroups in which variation in cell cycle expression is not correlated with outcome 
may be less sensitive to taxol or the vinca alkaloids. 

[00259] The "comprehensive prognosis" approach significantly improved the prediction error 
rate when compared with 70 gene classifier (Table 7). To make the comparison fair, we 
listed two sets of results from the 70 gene classifier. The first results from the use of the 
same threshold applied to all the patient subsets (threshold previously optimized for false 
negative rate); the second one results from the use of a threshold optimized for each patient 
subset (optimized for average error rate). The comprehensive approach lowered the error rate 
by at least 6%. 

Table 7. Average error rate for the patient subset approach compared with the previously- 



described 70 gene classifier. 



Prognosis method 


over all error rate 


5 year error rate 


70 gene, fix thresh 


30.90% 


25.70% 


70 gene, opt thresh 


28.60% 


27.60% 


Comprehensive 


21.50% 


19.30% 



[00260] Fix thresh: use of a fixed threshold in the classifier as previously determined. 
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[00261] Opt threshold: use of a threshold optimized for each sub-group. For the "ER/Age 
low, LN+" subgroup, 56 samples used for developing the classifier were included here, 
resulted in 306 samples in total. 
[00262] 

7. REFERENCES CITED 

[00263] All references cited herein are incorporated herein by reference in their entirety and 
for all purposes to the same extent as if each individual publication or patent or patent 
application was specifically and individually indicated to be incorporated by reference in its 
entirety for all purposes. 

[00264] Many modifications and variations of the present invention can be made without 
departing from its spirit and scope, as will be apparent to those skilled in the art. The specific 
embodiments described herein are offered by way of example only, and the invention is to be 
limited only by the terms of the appended claims along with the full scope of equivalents to 
which such claims are entitled. 
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